Privacy-FirstWorks with OllamaMCP ServerMultilingual

The pytest for RAG
Evaluate in 2 commands

Generate QA test datasets from your docs. Score your RAG with LLM-as-judge. Privacy-first — works fully offline with Ollama.

$pip install ragscore

Star on GitHub Try in Colab

PyPI Downloads since Dec 27, 2025

terminal

# Step 1: Generate QA pairs from your docs

$ ragscore generate docs/

✅ Generated 50 QA pairs → output/generated_qas.jsonl

# Step 2: Evaluate your RAG system

$ ragscore evaluate http://localhost:8000/query

============================================================

✅ EXCELLENT: 85/100 correct (85.0%)

Average Score: 4.20/5.0

============================================================

How It Works

Two simple commands

Feed your docs

Point RAGScore at your PDF, TXT, or Markdown files. It reads, chunks, and understands them.

ragscore generate docs/

Generate golden QA

LLM creates diverse question-answer pairs with rationale and evidence spans.

→ output/generated_qas.jsonl

Score your RAG

Each question is sent to your RAG. LLM-as-judge scores the answers 1-5 across 5 metrics.

ragscore evaluate http://rag/query

Why RAGScore?

The fastest way to validate your RAG pipeline

🔒

100% Private

Works with Ollama — no API keys, no cloud, your data never leaves your machine

⚡

Lightning Fast

Async pipeline generates 100 QA pairs in minutes. Evaluate in seconds.

🤖

Any LLM

OpenAI, Anthropic, Ollama, DeepSeek, Groq, Mistral, DashScope — auto-detected from your env vars.

📊

Detailed Metrics

5-metric evaluation: correctness, completeness, relevance, conciseness, faithfulness

🌍

Multilingual

Auto-detects English, Chinese, Japanese, and German. Prompts adapt to your language.

🧩

AI Agent Ready

MCP server for Claude Desktop, Cursor, and other AI assistants

📓

Notebook-Friendly

Rich objects with .plot() and .df — perfect for Jupyter and Colab

🧪

CI/CD Ready

Use quick_test() in pytest. Set accuracy thresholds, get pass/fail, catch regressions automatically.

🔧

Auto-Corrections

Get a list of incorrect answers with corrections. Inject them into your RAG to improve accuracy.

How It Works

Two simple commands

Python API

# One-liner RAG evaluation

from ragscore import quick_test

result = quick_test(

endpoint="http://localhost:8000/query",

docs="docs/",

detailed=True,

)

result.plot() # Radar chart

result.df # pandas DataFrame

result.corrections # Items to fix

Outputdetailed=True

✅ PASSED: 8/10 correct (80%)

Average Score: 4.2/5.0

──────────────────────────────

Correctness: 4.5/5.0

Completeness: 4.2/5.0

Relevance: 4.8/5.0

Conciseness: 3.9/5.0

Faithfulness: 4.6/5.0

══════════════════════════════

2 corrections available.

Use from Claude Desktop

MCP server for Claude Desktop, Cursor, and other AI assistants

Install with MCP support

pip install ragscore[mcp]

Add to Claude Desktop config

{

"mcpServers": {

"ragscore": {

"command": "ragscore",

"args": ["serve"]

}

Ask Claude to evaluate your RAG

"Generate QA pairs from my docs/ folder and evaluate my RAG at http://localhost:8000/query"

Available MCP Tools

generate_qa_dataset

Generate QA pairs from PDFs, TXT, or Markdown files

evaluate_rag

Score your RAG endpoint against a QA dataset

quick_test_rag

Generate + evaluate in one call with pass/fail

get_corrections

Get incorrect answers with suggested fixes

Works with every LLM

Auto-detected from your environment variables. Zero config.

OllamaOpenAIAnthropicDeepSeekGroqMistralDashScopeGrokTogether AIvLLMAny OpenAI-compatible

Who uses RAGScore?

From solo developers to enterprise AI teams.

AI Engineers

Test your RAG pipeline before deploying. Catch hallucinations and missing context early.

"We caught 15 hallucinations in our legal RAG before going live."

MLOps / CI/CD

Add quick_test() to your test suite. Fail the build if accuracy drops below threshold.

assert quick_test(endpoint, docs).passed

Enterprise Teams

Evaluate RAG accuracy across departments. Finance, HR, Legal — each with different accuracy requirements.

Run offline with Ollama. No data leaves your VPC.

AI Consultants

Deliver quantified RAG quality reports to clients. Show before/after improvement metrics.

"Accuracy improved from 62% to 91% after 3 rounds."

Ready to test your RAG?

Join developers using RAGScore for production RAG evaluation

The pytest for RAGEvaluate in 2 commands

How It Works

Feed your docs

Generate golden QA

Score your RAG

Why RAGScore?

100% Private

Lightning Fast

Any LLM

Detailed Metrics

Multilingual

AI Agent Ready

Notebook-Friendly

CI/CD Ready

Auto-Corrections

How It Works

Use from Claude Desktop

Install with MCP support

Add to Claude Desktop config

Ask Claude to evaluate your RAG

Works with every LLM

Who uses RAGScore?

AI Engineers

MLOps / CI/CD

Enterprise Teams

AI Consultants

Ready to test your RAG?

The pytest for RAG
Evaluate in 2 commands