AIveilix · Industrial Benchmark v2
Industrial benchmark — head-to-head

Benchmark Report · v2 (industrial)

AIveilix vs AnythingLLM
22 documents · 80 questions · 6 formats

An industrial-grade head-to-head: 22 real-world documents (financial filings, scientific papers, government reports, whitepapers, RFCs, spreadsheets, images), 80 questions across 11 categories and 7 question types, scored by an independent LLM judge with a fixed rubric.

Section 01

Summary scoreboard

All metrics on the same 80-question eval set with the same 22 documents on both sides.

MetricAIveilixAnythingLLMWinner

Section 02

Visual breakdown

AIveilix AnythingLLM

Average correctness

LLM judge score, 1–5. Higher is better.

Answer-quality distribution

Question counts in each score bucket.

Latency (ms)

Lower is better.

Per-question latency

Time-to-answer per question, in order.

Per-question correctness

Score for every question, side by side. Hover for details.

Section 03

Where each system wins

Average correctness sliced by document category, question type, and file format.

By document category

Avg correctness per topical domain.

By question type

Avg correctness per question kind.

By file format

Does it matter whether the doc is PDF / TXT / MD / CSV / image?

Section 04

How the benchmark works

The corpus (22 documents, 6 formats)

Question types (80 total)

Judge

an independent LLM judge (temperature=0 for deterministic scoring) scored each answer with a fixed rubric prompt. Same judge for both systems → bias cancels.

Section 05

Per-question deep dive

Click any question to expand both answers + the judge's reasoning.

Section 06

Takeaways