LLMs as Judges: Automating Quality at Scale

Contributors
Andreas Granmo
CTO, Obvlo
Share

In the current engineering landscape, "LLM as a Judge" has become a standard term. It refers to the practice of using one Large Language Model to evaluate the output of another (or of a human). While this is now a recognised design pattern, at Obvlo, this concept has been the engine room of our architecture since we integrated the earliest commercial models.

Our mission is to empower travel businesses to sell destinations at scale. "Scale" is the engineering constraint here. We cannot manually review hundreds of thousands of assets. "Hyper-relevant" is the quality constraint. Automated content cannot look automated.

To bridge the gap between volume and quality, we rely on the judgment of our models.

The Visual Obstacle

Our first hurdle was ingestion. We ingest massive datasets of imagery from various partners and sources. Manually sorting good travel photography from bad is impossible at our volume.

Early on, we deployed multimodal models to act as gatekeepers. We don't just ask a model to tag an image; we ask it to critique it. Is the lighting poor? Is the composition cluttered? Does this image actually represent "luxury dining" or does it look like a standard cafeteria?

With every model iteration, from early vision implementations to the latest release, we see incremental improvements in nuance. The system is now capable of discarding technically proficient but contextually irrelevant images without human intervention.

Relevance Over Ratings

Once the visual ingestion pipeline was stable, we moved to curation. We quickly found that aggregating reviews and ratings is insufficient. A location perfect for a large family is often terrible for a couple, yet both users might give it five stars.

We call potential matches "candidates." For any given theme and location, we might find thousands of candidates. To sift through them, we implemented a judge that compares the candidate against the specific intent of the content we are building.

If we are generating a guide for couples, the system evaluates the candidate against that specific demographic context. It doesn't matter if an activity has perfect reviews; if it is better suited for a family, the judge rejects it. This semantic filtering allows us to maintain high relevance at a scale where manual curation would be impossible.

Judging Generative Media

As we moved from curating existing content to generating new images and video, the role of the "judge" became more critical.

Generative video is impressive, but prone to hallucinations and physical inconsistencies. We treat our generative pipelines as adversarial. One agent generates; a separate, distinct agent critiques. If the generated video of a hotel pool features water flowing uphill, the judge catches it. This automated feedback loop is the only way to safely deploy generative media in a commercial environment.

Localisation and Intent

I have lived in the UK for twenty years, but I still see the cracks in bad translations from Norwegian. Literal translation is easy; capturing intent is difficult.

In travel, "vibe" matters. A description of a night market needs to feel bustling, not just crowded. When we localise content, we do not simply pipe text through a translation API. We use a judge-evaluator pattern.

We generate the translation, and then a secondary process compares the intent of the source material against the meaning of the output. We ask the system: "Does this translated text evoke the same feeling as the original?" If the answer is no, it is regenerated. This ensures that meaning is not lost in syntax.

Evaluating the Evaluators

The most common question I get from other engineers is: "Which model do you use?"

The answer is: "The one that wins the test this week."

We apply the same rigorous evaluation to the models themselves. We have built an internal framework to benchmark new models as soon as they are released. We don't rely on generic public benchmarks. We test them on our data: travel descriptions, hotel metadata, and destination imagery.

We measure:

  1. Quality: Is the prose natural?
  2. Correctness: Is the factual data (opening times, location) preserved?
  3. Intent: Does it meet the user's requirement?

Our architecture is modular. Because we have hard metrics on model performance, we can swap out the underlying engine for a specific task within weeks of a new release. We don't upgrade because of hype; we upgrade because our internal benchmarks prove it’s necessary.

AI is not a wrapper for us; it is the product. And for the product to work, we have to trust the intelligence of the architecture to judge its own work.