I ran a partial benchmark against marker - https://github.com/VikParuchuri/marke...

lolinder · 2025-03-07T02:17:16 1741313836

> with LLM as a judge

For anyone else interested, prompt is here [0]. The model used was gemini-2.0-flash-001.

Benchmarks are hard, and I understand the appeal of having something that seems vaguely deterministic rather than having a human in the loop, but I have a very hard time accepting any LLM-judged benchmarks at face value. This is doubly true when we're talking about something like OCR which, as you say, is a very hard problem for computers of any sort.

I'm assuming you've given this some thought—how did you arrive at using an LLM to benchmark OCR vs other LLMs? What limitations with your benchmark have you seen/are you aware of?

[0] https://github.com/VikParuchuri/marker/blob/master/benchmark...

themanmaran · 2025-03-07T05:52:57 1741326777

We also ran an OCR benchmark with LLM as judge using structured outputs. You can check out the full methodology on the repo [1]. But the general idea is:

- Every document has ground truth text, a JSON schema, and the ground truth JSON.

- Run OCR on each document and pass the result to GPT-4o along with the JSON Schema

- Compare the predicted JSON against the ground truth JSON for accuracy.

In our benchmark, the ground truth text => gpt-4o was 99.7%+ accuracy. Meaning whenever gpt-4o was given the correct text, it could extract the structured JSON values ~100% of the time. So if we pass in the OCR text from Mistral and it scores 70%, that means the inaccuracies are isolated to OCR errors.

https://github.com/getomni-ai/benchmark

cdolan · 2025-03-07T05:54:52 1741326892

were you guys able to finish running the benchmark with mistral and got a 70% score? Missed that

Edit - I see it on the Benchmark page now. Woof, low 70% scores in some areas!

https://getomni.ai/ocr-benchmark

themanmaran · 2025-03-07T07:27:18 1741332438

Yup, surprising results! We were able to dig in a bit more. Main culprit is the overzealous "image extraction". Where if Mistral classifies something as an image, it will replace the entire section with (image)[image_002).

And it happened with a lot of full documents as well. Ex: most receipts got classified as images, and so it didn't extract any text.

cdolan · 2025-03-07T14:04:15 1741356255

This sounds like a real problem and hurdle for North American (US/CAN in particular) invoice and receipt processing?

lingjiekong · 2025-03-07T11:47:03 1741348023

where do you find this regarding "Where if Mistral classifies something as an image, it will replace the entire section with (image)[image_002)."?

culi · 2025-03-07T15:35:55 1741361755

themanmaran works at Omni so presumably they have access to the actual resulting data from this study

someothherguyy · 2025-03-07T10:47:03 1741344423

Wouldn't that just bias itself to the shape of the text extracted from the OCR against the shape of the raw text alone? It doesn't seem like it would be a great benchmark for estimating semantic accuracy?

vikp · 2025-03-07T02:41:23 1741315283

Benchmarking is hard for markdown because of the slight formatting variations between different providers. With HTML, you can use something like TEDS (although there are issues with this, too), but with markdown, you don't have a great notion of structure, so you're left with edit distance.

I think blockwise edit distance is better than full page (find the ground truth blocks, then infer each block separately and compare), but many providers only do well on full pages, which doesn't make it fair.

There are a few different benchmark types in the marker repo:

  - Heuristic (edit distance by block with an ordering score)
  - LLM judging against a rubric
  - LLM win rate (compare two samples from different providers)

None of these are perfect, but LLM against a rubric has matched visual inspection the best so far.

I'll continue to iterate on the benchmarks. It may be possible to do a TEDS-like metric for markdown. Training a model on the output and then benchmarking could also be interesting, but it gets away from measuring pure extraction quality (the model benchmarking better is only somewhat correlated with better parse quality). I haven't seen any great benchmarking of markdown quality, even at research labs - it's an open problem.

arthurcolle · 2025-03-07T03:18:53 1741317533

You can use structured outputs, or something like my https://arthurcolle--dynamic-schema.modal.run/

to extract real data from unstructured text (like that producted from an LLM) to make benchmarks slightly easier if you have a schema

cdolan · 2025-03-07T14:12:18 1741356738

What is the project? It just returns a vanilla html page saying:

Dynamic Schema API API is running. See documentation for available endpoints.

arthurcolle · 2025-03-11T20:49:04 1741726144

It's just a FastAPI app with endpoints that I developed and deployed before OpenAI released structured outputs that used a custom grammar to enforce a pydantic-like schema for Chain of Thought rollouts / structured data extraction from unstructured text. I also use it for a video transcription knowledge base generation API

https://arthurcolle--dynamic-schema.modal.run/docs

carlgreene · 2025-03-07T00:22:38 1741306958

Thank you for your work on Marker. It is the best OCR for PDFs I’ve found. The markdown conversion can get wonky with tables, but it still does better than anything else I’ve tried

vikp · 2025-03-07T02:42:46 1741315366

Thanks for sharing! I'm training some models now that will hopefully improve this and more :)

netdevphoenix · 2025-03-07T11:45:44 1741347944

LLM as a judge?

Isn't that a potential issue? You are assuming the LLM judge is reliable. What evidence do you have to assure yourself or/and others that it is reasonable assumption

bfors · 2025-03-07T19:03:39 1741374219

Perhaps they already evaluated their LLM judge model (with another LLM)

ntkris · 2025-03-07T07:21:54 1741332114

This is awesome. Have you seen / heard of any benchmarks where the data is actually a structured JSON vs. markdown?

ChrisRob · 2025-03-07T14:44:01 1741358641

Thanks for the tip. Marker solved a table conversion without LLM that docling wasn't able to solve.

codelion · 2025-03-07T03:53:52 1741319632

Really interesting benchmark, thanks for sharing! It's good to see some real-world comparisons. The hallucinations issue is definitely a key concern with LLM-based OCR, and it's important to quantify that risk. Looking forward to seeing the full benchmark results.

DeathArrow · 2025-03-07T05:33:05 1741325585

>Mistral OCR is an impressive model, but OCR is a hard problem, and there is a significant risk of hallucinations/missing text with LLMs.

To fight hallucinations, can't we use more LLMs and pick blocks where the majority of LLMs agree?

boxed · 2025-03-07T05:52:59 1741326779

Why wouldn't hallucinations be agreed upon if they have roughly the same training data?

TJSomething · 2025-03-07T06:52:57 1741330377

A hallucination is often an indication that the model doesn't know something. Then, the internal signal gets dominated by noise from the seeded training weights. Efforts to eliminate hallucinations with a single model have found success by asking the same question in different ways and only taking answers that agree. Logically, you could get more durable results from multiple models on the same prompt.

supriyo-biswas · 2025-03-07T08:20:10 1741335610

We had this article the other day[1] about how multiple LLMs can hallucinate about the same thing, so this is not guaranteed to remove hallucinations that are caused by poor or insufficient training data.

[1] https://news.ycombinator.com/item?id=43222027

boxed · 2025-03-07T07:16:42 1741331802

I don't see why any of that makes logical sense. These models require such enormous training data that they pretty much MUST use the same training data to a very large degree. The training data itself is what they spit out. So "hallucinations" are just the training data you get out, which is the entire point of the models in the first place. There is no difference between an hallucination and a correct answer from the perspective of the math.

neuronic · 2025-03-07T07:45:49 1741333549

Isn' it just statistical word pattern prediction based on training data? These models likely don't "know" something anyway and cannot verify "truth" and facts. Reasoning attempts seem to me basically just like looping until the model finds a self-satisfying equilibrium state with different output.

In that way, LLMs are more human than, say, a database or a book containing agreed-upon factual information which can be directly queried on demand.

Imagine if there was just ONE human with human limitations on the entire planet who was taught everything for a long time - how reliable do you think they are with information retrieval? Even highly trained individuals (e.g. professors) can get stuff wrong on their specific topics at times. But this is not what we expect and demand from computers.

stavros · 2025-03-07T09:17:17 1741339037

I like the licensing options! Hopefully they make enough money to fund development.