Privacy-FirstWorks with OllamaMCP ServerMultilingual

The pytest for RAG
Evaluate in 2 commands

Generate QA test datasets from your docs. Score your RAG with LLM-as-judge. Privacy-first โ€” works fully offline with Ollama.

$pip install ragscore
0+
PyPI Downloads since Dec 27, 2025
terminal

# Step 1: Generate QA pairs from your docs

$ ragscore generate docs/

โœ… Generated 50 QA pairs โ†’ output/generated_qas.jsonl

# Step 2: Evaluate your RAG system

$ ragscore evaluate http://localhost:8000/query

============================================================

โœ… EXCELLENT: 85/100 correct (85.0%)

Average Score: 4.20/5.0

============================================================

How It Works

Two simple commands

1

Feed your docs

Point RAGScore at your PDF, TXT, or Markdown files. It reads, chunks, and understands them.

ragscore generate docs/
2

Generate golden QA

LLM creates diverse question-answer pairs with rationale and evidence spans.

โ†’ output/generated_qas.jsonl
3

Score your RAG

Each question is sent to your RAG. LLM-as-judge scores the answers 1-5 across 5 metrics.

ragscore evaluate http://rag/query

Why RAGScore?

The fastest way to validate your RAG pipeline

๐Ÿ”’

100% Private

Works with Ollama โ€” no API keys, no cloud, your data never leaves your machine

โšก

Lightning Fast

Async pipeline generates 100 QA pairs in minutes. Evaluate in seconds.

๐Ÿค–

Any LLM

OpenAI, Anthropic, Ollama, DeepSeek, Groq, Mistral, DashScope โ€” auto-detected from your env vars.

๐Ÿ“Š

Detailed Metrics

5-metric evaluation: correctness, completeness, relevance, conciseness, faithfulness

๐ŸŒ

Multilingual

Auto-detects English, Chinese, Japanese, and German. Prompts adapt to your language.

๐Ÿงฉ

AI Agent Ready

MCP server for Claude Desktop, Cursor, and other AI assistants

๐Ÿ““

Notebook-Friendly

Rich objects with .plot() and .df โ€” perfect for Jupyter and Colab

๐Ÿงช

CI/CD Ready

Use quick_test() in pytest. Set accuracy thresholds, get pass/fail, catch regressions automatically.

๐Ÿ”ง

Auto-Corrections

Get a list of incorrect answers with corrections. Inject them into your RAG to improve accuracy.

How It Works

Two simple commands

Python API

# One-liner RAG evaluation

from ragscore import quick_test

result = quick_test(

endpoint="http://localhost:8000/query",

docs="docs/",

detailed=True,

)

result.plot() # Radar chart

result.df # pandas DataFrame

result.corrections # Items to fix

Outputdetailed=True

โœ… PASSED: 8/10 correct (80%)

Average Score: 4.2/5.0

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

Correctness: 4.5/5.0

Completeness: 4.2/5.0

Relevance: 4.8/5.0

Conciseness: 3.9/5.0

Faithfulness: 4.6/5.0

โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

2 corrections available.

Use from Claude Desktop

MCP server for Claude Desktop, Cursor, and other AI assistants

1

Install with MCP support

pip install ragscore[mcp]
2

Add to Claude Desktop config

{

"mcpServers": {

"ragscore": {

"command": "ragscore",

"args": ["serve"]

}

}

}

3

Ask Claude to evaluate your RAG

"Generate QA pairs from my docs/ folder and evaluate my RAG at http://localhost:8000/query"

Available MCP Tools
generate_qa_dataset

Generate QA pairs from PDFs, TXT, or Markdown files

evaluate_rag

Score your RAG endpoint against a QA dataset

quick_test_rag

Generate + evaluate in one call with pass/fail

get_corrections

Get incorrect answers with suggested fixes

Works with every LLM

Auto-detected from your environment variables. Zero config.

OllamaOpenAIAnthropicDeepSeekGroqMistralDashScopeGrokTogether AIvLLMAny OpenAI-compatible

Who uses RAGScore?

From solo developers to enterprise AI teams.

AI Engineers

Test your RAG pipeline before deploying. Catch hallucinations and missing context early.

"We caught 15 hallucinations in our legal RAG before going live."

MLOps / CI/CD

Add quick_test() to your test suite. Fail the build if accuracy drops below threshold.

assert quick_test(endpoint, docs).passed

Enterprise Teams

Evaluate RAG accuracy across departments. Finance, HR, Legal โ€” each with different accuracy requirements.

Run offline with Ollama. No data leaves your VPC.

AI Consultants

Deliver quantified RAG quality reports to clients. Show before/after improvement metrics.

"Accuracy improved from 62% to 91% after 3 rounds."

Ready to test your RAG?

Join developers using RAGScore for production RAG evaluation