Inferential Question Answering (Inferential QA) introduces a new class of reasoning QA tasks that challenge models to infer answers from indirect textual evidence rather than extracting them directly from answer-containing passages.
We present QUIT (QUestions requiring Inference from Texts) β a large-scale benchmark of 7,401 questions and 2.4 million passages, designed to evaluate how well modern retrieval-augmented systems and large language models (LLMs) can perform inference-based reasoning.
Most existing QA datasets assume answer containment β that the answer explicitly appears in a retrieved passage.
However, many real-world questions (e.g., educational reasoning, knowledge-based inference) require deriving answers from clues and context instead.
Inferential QA bridges this gap by focusing on answer-supporting passages β those that provide evidence for inference, not the answer itself.
QUIT (QUestions requiring Inference from Texts) is a large-scale benchmark designed to test whether modern QA systems can solve questions where:
β
the evidence is present
β but the answer is not explicitly stated
Unlike traditional QA datasets, QUIT focuses on answer-supporting passages: passages that contain clues, not spans.
- π§ 7,401 inference-heavy questions
- π 2.4M passages built from compositional hint combinations
- π§© Each question has 325 candidate passages
- π― Multi-level relevance labels:
- 2: fully relevant (enables inference)
- 1: partially relevant (weak or indirect evidence)
- 0: irrelevant
| Split | # Questions | # Passages |
|---|---|---|
| Train | 4,811 | 1,563,575 |
| Dev | 862 | 280,150 |
| Test | 1,728 | 561,600 |
| Total | 7,401 | 2,405,325 |
β The full QUIT benchmark is publicly available on HuggingFace:
π HuggingFace Dataset: https://huggingface.co/datasets/JamshidJDMY/InferentialQA
-
π₯ Corpus (2.4M passages)
https://huggingface.co/datasets/JamshidJDMY/InferentialQA/resolve/main/corpus/corpus.jsonl?download=true -
π₯ Train Set (4,811 questions)
https://huggingface.co/datasets/JamshidJDMY/InferentialQA/resolve/main/train.json?download=true -
π₯ Dev Set (862 questions)
https://huggingface.co/datasets/JamshidJDMY/InferentialQA/resolve/main/dev.json?download=true -
π₯ Test Set (1,728 questions)
https://huggingface.co/datasets/JamshidJDMY/InferentialQA/resolve/main/test.json?download=true
- Use the Corpus for indexing (retrievers / rerankers)
- Use Train for fine-tuning retrievers/rerankers
- Use Dev/Test for fair comparison and reporting benchmark numbers
QUIT is constructed in two stages:
- Source datasets: TriviaHG (machine-authored hints) & WikiHint (human-authored hints)
- Filtered using BEM to remove answer leakage
- Question type and difficulty estimated via HintEval
- Removed questions that LLMs could answer parametrically (without context)
- Generated all subsets and permutations of top-5 hints per question β 325 passages per question
- Labeled using Gemma 3 1B, Qwen 3 4B, LLaMA 3.1 8B with GPT-Eval
- Dev/Test verified by human annotators and relabeled for leakage
We evaluate a RetrieverβRerankerβReader pipeline across multiple models:
| Component | Models |
|---|---|
| Retrievers | BM25, DPR, ColBERT, Contriever, BGE |
| Rerankers | LiT5, MonoT5, RankGPT, RankT5, UPR |
| Readers (LLMs) | LLaMA 3.2 1B, Gemma 3 4B, Qwen 3 8B |
Evaluation metrics: Hit@K, Recall@K, MRR, NDCG@K, and Exact Match (EM).
If retrieval and reranking were perfect, LLMs could achieve β 90% EM (oracle).
However, current pipelines reach only ~10β15% EM.
General-purpose LLMs (Gemma 3 4B) outperform reasoning-oriented ones (Qwen 3 8B), showing that scaling or reasoning orientation alone does not solve inference-based QA.
- π§ Retrieval is the dominant bottleneck β current retrievers cannot locate answer-supporting passages.
- π Reranking helps little; fine-tuning retrievers and rerankers gives inconsistent gains.
- π§ General-purpose LLMs (e.g., Gemma 3 4B) handle inferential reasoning better than reasoning-specialized ones.
- π¨ The gap between Oracle (~90% EM) and real pipelines (~10%) exposes the core limitation of todayβs RAG systems in inference-based reasoning.
We release QUIT together with full reproducibility scripts and pre-computed results, so anyone can:
β
reproduce all benchmark numbers
β
evaluate new retrievers / rerankers / readers
β
compare against strong baselines
β οΈ Recommended: Python 3.10 (some dependencies are not fully compatible with newer versions)
git clone https://github.com/DataScienceUIBK/InferentialQA.git
cd InferentialQA
pip install -r requirements.txtAll experiments are organized inside experiments/.
To reproduce any experiment:
- go to its folder
- run the provided
run.sh
β Suggested order (end-to-end benchmark reproduction):
-
experiments/dataset
Download QUIT from HuggingFace -
experiments/index
Build indexes and preprocess corpus -
experiments/baseline
Wikipedia / MSMARCO baselines -
experiments/vanilla/oracle-rerankers
Oracle reranker experiments (upper-bound analysis) -
experiments/vanilla/retrievers
Retriever-only benchmark runs -
experiments/vanilla/rerankers
Retriever + reranker -
experiments/vanilla/rag
Full Retriever β Reranker β Reader pipeline
We also provide scripts to fine-tune components on QUIT:
experiments/finetuning/colbertexperiments/finetuning/dprexperiments/finetuning/monot5
And complete pipeline evaluations:
experiments/finetuning_pipeline/ft-retriever/rerankerexperiments/finetuning_pipeline/ft-retriever/ragexperiments/finetuning_pipeline/ft-reranker/retrieverexperiments/finetuning_pipeline/ft-reranker/ragexperiments/finetuning_pipeline/ft-reranker/retriever_reranker
β‘ Note: some fine-tuning experiments require serious compute
e.g., β₯ 1Γ NVIDIA A100 GPU, and can take multiple days.
Some fine-tuning pipelines rely on external toolkits. Please set up their environments separately:
- ColBERT (using & fine-tuning): follow the official repository: https://github.com/stanford-futuredata/ColBERT
- DPR fine-tuning: use Tevatron and follow their instructions: https://github.com/texttron/tevatron
- MonoT5 fine-tuning: use pygaggle and follow their instructions: https://github.com/castorini/pygaggle
No powerful resources? No problem.
We provide precomputed outputs for all benchmark experiments. To reproduce tables and analysis from the paper:
- go to the
results/directory - run the Python scripts
They will automatically download the needed files from HuggingFace and display the final results.
π This option makes QUIT easy to use for:
- quick benchmarking
- ablation studies
- comparing new models
- classroom/educational usage
| Rank | Model | Retriever | Reranker | Reader | EM |
|---|---|---|---|---|---|
| β | Optimal | β | β | Gemma-3-4B | 90.16% |
| π₯ | Baseline | BGE | MonoT5 | Gemma-3-4B | 15.34% |
| π₯ | Baseline | BGE | FT-MonoT5 | Gemma-3-4B | 13.89% |
| π₯ | Baseline | BGE | β | Gemma-3-4B | 13.18% |
Stay tuned for the official leaderboard and evaluation scripts once the dataset is released.
- π Inferential QA requires reasoning from clues β not explicit spans
- βοΈ Current retrievers and rerankers fail to identify sufficient evidence
- π§© Fine-tuning is insufficient; new paradigms for retrieval-augmented reasoning are needed
- π QUIT exposes a fundamental limitation in todayβs QA pipelines and opens a new research direction
If you use InferentialQA / QUIT in your research, please cite our paper:
We will update this BibTeX entry once the paper is officially published.
This project is released under the MIT License. See the LICENSE file for details.
β
Introduce Inferential QA, a new reasoning-based QA task
β
Construct QUIT, the first large-scale dataset for inferential question answering
β
Evaluate retrievers, rerankers, and LLM readers extensively
β
Show that current QA pipelines fail under inference-based reasoning