Skip to Content

LLM Response Judge

Auto-grade LLM responses with customizable rubrics, multi-provider evaluation, AI-powered rewrites, and a real-time React dashboard — no manual review required


Problem Statement

We Asked NEO to: Build a full-stack LLM evaluation web app with:


Solution Overview

NEO built a production-ready LLM evaluation app that replaces manual review with automated rubric-driven grading:

  1. Multi-Provider Evaluation Engine routes to Claude, GPT-4, Gemini, OpenRouter, or Ollama — same rubric, same scoring, any provider
  2. Customizable Rubric System offers 3 built-in presets plus a full rubric editor with weighted criteria
  3. Auto-Improvement Pipeline rewrites bottom-performing responses with predicted score improvements
  4. Real-Time React Dashboard shows live progress, per-criterion breakdowns, critical issue flags (bottom 10%), and Chart.js visualizations
  5. Demo Mode loads 20 pre-evaluated responses instantly — no API key, full feature exploration in under 60 seconds

Workflow / Pipeline

StepDescription
1. File Upload & ParsingFileUpload.jsx validates CSV/JSON structure and checks required fields (question, response) before evaluation begins
2. Provider & Rubric SelectionChoose any LLM provider and an evaluation rubric — 3 built-in presets or a custom rubric with weighted criteria via RubricEditor.jsx
3. API Key ValidationKey validated via /validate-key before evaluation starts — stored in browser localStorage only, never persisted server-side
4. Batch EvaluationFastAPI routes each response through the evaluation prompt (backend/prompts/judge.py) — scores each criterion with weighted aggregation
5. Per-Criterion ScoringLLM returns structured scores per rubric criterion with justification text — granular feedback beyond a single composite score
6. Critical Issue DetectionDashboard automatically flags the bottom 10% of responses for priority review — no manual sorting needed
7. Auto-Improvement/improve endpoint sends the original Q&A + rubric context to the LLM and returns a rewrite with predicted score gain
8. Dashboard & ExportReal-time Chart.js visualizations, sortable results table, expandable per-response breakdowns — exportable to PDF or Markdown

Repository & Artifacts

Dakshjain1604/LLM-response-JudgeView on GitHub

Generated Artifacts:


Technical Details


Results

Example Evaluation Output (Single Response)

Question: "How do I reset my password?"

Evaluation Results:
┌──────────────────┬───────┬────────────────────────────────────┐
│ Criterion        │ Score │ Justification                      │
├──────────────────┼───────┼────────────────────────────────────┤
│ Accuracy         │  9/10 │ Correct steps, no misleading info  │
│ Clarity          │  8/10 │ Clear and concise, minor gaps      │
│ Empathy          │  6/10 │ Lacks warm acknowledgement         │
│ Completeness     │  7/10 │ Missing 2FA recovery mention       │
│ Actionability    │  9/10 │ User can act immediately           │
└──────────────────┴───────┴────────────────────────────────────┘

Composite Score: 78/100  |  ⚠️ Needs Improvement
Predicted Score After Auto-Improvement: 91/100  (+13 points)

Batch Evaluation Summary (100 Responses)

Batch Evaluation Complete
─────────────────────────────────────────────
Total Responses:    100   ✓
Average Score:       74.3 / 100

Score Distribution:
  Excellent (90+):  18   ██████
  Good (75-89):     41   █████████████
  Fair (60-74):     29   ██████████
  Poor (<60):       12   ████

Critical Issues Flagged: 10  (bottom 10%)
Evaluation Time: 2m 47s  |  Provider: Claude 3.5 Sonnet
─────────────────────────────────────────────

Best Practices & Lessons Learned


Next Steps


References

View source on GitHub


Learn More