Lj V. Miranda - Micro Blog

On building my personal LLM benchmark

Over the past week, I’ve experimented upon building a personal LLM benchmark that tests models based on my specific tastes and requirements. I’d definitely encourage anyone to make their own suites, but it’s very involved. I’ll probably write about my experience soon in an actual blog post…

Anyway, upon investigating my test-cases, I learned that most of the things I care about are a matter of taste and style. For example, I prefer matplotlib code to be done a certain way (using fig, ax = plt.subplots(...) vs direct plt) or concise responses over verbose ones. I wonder if there’s a way we can incorporate these personal preferences during finetuning (assuming we have the resources to do so) with zero-to-few human annotation effort.

This reminds me of a slide from John Schulman’s talk on open problems in High-Quality Human Feedback (we actually took a stab on this problem in a previous work):

Again, this is something I want to write more about in the future. Still organizing my thoughts on this. And by the way, seems like Qwen/Qwen2.5-14B-Instruct-1M is the best model in my personal benchmark so far when accounting for performance and cost-efficiency.

An observation: it seems like tooling for LLM inference diverges between research/academia and industry in some ways. For the past year, I’ve been using a lot of vLLM for inference and (recently) curator for data generation–mostly for research. But I’ve seen a lot of my colleagues from the industry use outlines, LangGraph, and PydanticAI.

Kingdom Come Deliverance II (KCD 2) is so immersive that it might as well be my Game of the Year (I know, it’s too early to say). The potion-making tutorial is so good, it feels like I’m doing it for real!

Blacksmithing is also relaxing! We made so many horseshoes and axes in my first day :)

This is definitely a “game you play during the holidays,” so I can’t wait for summer to sink hours into this gem. I’m still too early to the story yet, but I already recommend this for everyone!

TIL: arXiV-ready LaTeX files from Overleaf

One persistent problem I often come across is how the LaTeX files from Overleaf that I upload to arXiv have source or compatibility errors. Today, I learned that instead of downloading the zip archive to get the LaTeX source, I should use the “Submit” button.

Don’t click the archive button:

Instead, go to your Overleaf project > Submit > Submit Your Papers to ArXiV:

This creates an optimized zip file that is 100% compatible with arXiv!

On Filipino NLP

Over the holidays, I’ve been thinking a lot about what it means to do Filipino NLP now that we’re in the age of LLMs. Multilingual LLMs are getting better and core NLP tasks such as NER or sentiment analysis are now streamlined by models like GPT-4.

I’ve decided to bet on post-training and adaptation. I believe that this unlocks several opportunities for resource-constrained and small Filipino NLP research communities to contribute to a larger whole. Here’s an excerpt from my blog post:

While I still believe in building artisanal Filipino NLP resources, I now see that we need to simultaneously support the development of multilingual LLMs by creating high-quality Filipino datasets and benchmarks. This way, we can actively push for the inclusion of Philippine languages in the next generation of multilingual LLMs, rather than just waiting for improvements to happen on their own.