Generate QA test datasets from your docs. Score your RAG with LLM-as-judge. Privacy-first โ works fully offline with Ollama.
# Step 1: Generate QA pairs from your docs
$ ragscore generate docs/
โ Generated 50 QA pairs โ output/generated_qas.jsonl
# Step 2: Evaluate your RAG system
$ ragscore evaluate http://localhost:8000/query
============================================================
โ EXCELLENT: 85/100 correct (85.0%)
Average Score: 4.20/5.0
============================================================
Two simple commands
Point RAGScore at your PDF, TXT, or Markdown files. It reads, chunks, and understands them.
ragscore generate docs/LLM creates diverse question-answer pairs with rationale and evidence spans.
โ output/generated_qas.jsonlEach question is sent to your RAG. LLM-as-judge scores the answers 1-5 across 5 metrics.
ragscore evaluate http://rag/queryThe fastest way to validate your RAG pipeline
Works with Ollama โ no API keys, no cloud, your data never leaves your machine
Async pipeline generates 100 QA pairs in minutes. Evaluate in seconds.
OpenAI, Anthropic, Ollama, DeepSeek, Groq, Mistral, DashScope โ auto-detected from your env vars.
5-metric evaluation: correctness, completeness, relevance, conciseness, faithfulness
Auto-detects English, Chinese, Japanese, and German. Prompts adapt to your language.
MCP server for Claude Desktop, Cursor, and other AI assistants
Rich objects with .plot() and .df โ perfect for Jupyter and Colab
Use quick_test() in pytest. Set accuracy thresholds, get pass/fail, catch regressions automatically.
Get a list of incorrect answers with corrections. Inject them into your RAG to improve accuracy.
Two simple commands
# One-liner RAG evaluation
from ragscore import quick_test
result = quick_test(
endpoint="http://localhost:8000/query",
docs="docs/",
detailed=True,
)
result.plot() # Radar chart
result.df # pandas DataFrame
result.corrections # Items to fix
โ PASSED: 8/10 correct (80%)
Average Score: 4.2/5.0
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Correctness: 4.5/5.0
Completeness: 4.2/5.0
Relevance: 4.8/5.0
Conciseness: 3.9/5.0
Faithfulness: 4.6/5.0
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
2 corrections available.
MCP server for Claude Desktop, Cursor, and other AI assistants
pip install ragscore[mcp]{
"mcpServers": {
"ragscore": {
"command": "ragscore",
"args": ["serve"]
}
}
}
"Generate QA pairs from my docs/ folder and evaluate my RAG at http://localhost:8000/query"
generate_qa_datasetGenerate QA pairs from PDFs, TXT, or Markdown files
evaluate_ragScore your RAG endpoint against a QA dataset
quick_test_ragGenerate + evaluate in one call with pass/fail
get_correctionsGet incorrect answers with suggested fixes
Auto-detected from your environment variables. Zero config.
From solo developers to enterprise AI teams.
Test your RAG pipeline before deploying. Catch hallucinations and missing context early.
"We caught 15 hallucinations in our legal RAG before going live."
Add quick_test() to your test suite. Fail the build if accuracy drops below threshold.
assert quick_test(endpoint, docs).passed
Evaluate RAG accuracy across departments. Finance, HR, Legal โ each with different accuracy requirements.
Run offline with Ollama. No data leaves your VPC.
Deliver quantified RAG quality reports to clients. Show before/after improvement metrics.
"Accuracy improved from 62% to 91% after 3 rounds."
Join developers using RAGScore for production RAG evaluation