Benchmark Report · v2 (industrial)

AIveilix vs AnythingLLM
22 documents · 80 questions · 6 formats

An industrial-grade head-to-head: 22 real-world documents (financial filings, scientific papers, government reports, whitepapers, RFCs, spreadsheets, images), 80 questions across 11 categories and 7 question types, scored by an independent LLM judge with a fixed rubric.

Section 02

Visual breakdown

AIveilix AnythingLLM

Average correctness

LLM judge score, 1–5. Higher is better.

Answer-quality distribution

Question counts in each score bucket.

Latency (ms)

Lower is better.

Per-question latency

Time-to-answer per question, in order.

Per-question correctness

Score for every question, side by side. Hover for details.

Section 03

Where each system wins

Average correctness sliced by document category, question type, and file format.

By document category

Avg correctness per topical domain.

By question type

Avg correctness per question kind.

By file format

Does it matter whether the doc is PDF / TXT / MD / CSV / image?

Section 04

How the benchmark works

The corpus (22 documents, 6 formats)

Scientific / arXiv (5): Attention Is All You Need, BERT, GPT-3, Llama 2, DALL-E
Financial (3): Apple Q4 2023 results, Apple 2023 Environmental Report, Berkshire 2022 letter
Crypto whitepapers (2): Bitcoin (Nakamoto), Ethereum (Buterin)
Government / global (3): NASA Artemis Plan 2020, IMF WEO Oct 2023, UN SDG Report 2023
Legal / standards (3): US Constitution, GPLv3, CDC MMWR Vol 72/4
Technical RFCs (2): RFC 9110 (HTTP), RFC 8259 (JSON)
Tabular data (2): iris.csv, titanic.csv
Markdown / plain text (1): python_tutorial.md
Images (1): chart_us_gdp.png (US GDP line chart)

Question types (80 total)

Factual — single-fact retrieval
Conceptual — definitions and explanations
Multi-hop — needs ≥2 chunks combined
Numeric — find a number in a doc/table
Summarization — gist of a section
Cross-doc — compare facts across multiple docs
Visual — read a chart / image
Hallucination-trap — the answer isn't in the docs. Should refuse, not invent.

Judge

an independent LLM judge (temperature=0 for deterministic scoring) scored each answer with a fixed rubric prompt. Same judge for both systems → bias cancels.

AIveilix vs AnythingLLM
22 documents · 80 questions · 6 formats

Summary scoreboard

Visual breakdown

Average correctness

Answer-quality distribution

Latency (ms)

Per-question latency

Per-question correctness

Where each system wins

By document category

By question type

By file format

How the benchmark works

The corpus (22 documents, 6 formats)

Question types (80 total)

Judge

Per-question deep dive

Takeaways

AIveilix vs AnythingLLM22 documents · 80 questions · 6 formats

Summary scoreboard

Visual breakdown

Average correctness

Answer-quality distribution

Latency (ms)

Per-question latency

Per-question correctness

Where each system wins

By document category

By question type

By file format

How the benchmark works

The corpus (22 documents, 6 formats)

Question types (80 total)

Judge

Per-question deep dive

Takeaways

AIveilix vs AnythingLLM
22 documents · 80 questions · 6 formats