Generates 200 synthetic Q&A pairs, runs them through the RAG pipeline, then uses LLM-as-judge to score groundedness and completeness.
Click "Run Answer Benchmarks" to evaluate answer quality.