researchmlretrievalevaluation

Retrieval Adaptation Pipeline

Config-driven ML evaluation infrastructure for procedural adaptation under disruption.

Status
Research · Purdue CS490
Role
Research project
Stack
Python · PyTorch · FAISS · Transformers · Hugging Face · vLLM
architecture diagramfig. 01

01Problem

Standard VLM evaluations test what a model knows, not how it adapts when inputs are disrupted or incomplete. Procedural adaptation — substituting for a missing ingredient, working around an unavailable tool step — requires finding and applying analogous procedural knowledge rather than recalling a memorised answer.

There was no benchmark specifically designed to test procedural adaptation under disruption across multiple disruption types, and no systematic comparison of retrieval augmentation versus vision-grounded strategies on this task.

02Disruption taxonomy

Disruptions fall into three categories: ingredient removal (a required ingredient is absent and must be substituted), ingredient unavailability (an alternative must be identified from the retrieval library), and procedural unavailability (a step or tool cannot be used and the procedure must be replanned).

Each category requires a different adaptation strategy. The taxonomy shapes what the benchmark can measure — the three-way split enables separate analysis of substitution, identification, and replanning as distinct capabilities.

03Benchmark construction

YouCook2 and WikiHow were used as source corpora. YouCook2 provides video-grounded procedural recipes with step-level annotations; WikiHow provides text-grounded procedural instructions across a broader domain.

Total benchmark size: 3,292 examples — 2,342 training and retrieval-library rows, 475 development rows, and 475 test rows. Each example includes a disrupted procedural context, the disruption type label, and a ground-truth adaptation annotation.

A gold subset review was applied to the development and test splits — examples were manually inspected for annotation quality and ambiguous cases filtered before final evaluation.

04Retrieval pipeline

Three retrieval strategies were implemented: BM25 sparse retrieval over the text corpus; dense vector retrieval using FAISS with procedure-level embeddings indexed from the training and retrieval-library rows; and cross-modal reranking combining text retrieval scores with visual similarity signals.

Retrieved contexts are injected into the VLM prompt as augmentation. The retrieval pipeline logs retrieved candidates and scores to a persisted trace file, enabling post-hoc analysis of retrieval quality independently from generation quality.

05Experiment orchestration

Experiments are defined as YAML manifests specifying dataset split, retrieval method, model, and evaluation metric. The pipeline supports resume — interrupted experiments pick up from the last completed batch using the persisted trace.

Config-driven design means adding a new model or retrieval method requires only a manifest change, not a code change. This was important for comparing methods without diverging codebases across runs.

06Evaluation pipeline

Two evaluation modes: rule-based (exact match and token overlap metrics against ground-truth annotations) and LLM judge (a prompted judge model scores adaptation quality along axes of correctness, completeness, and plausibility).

Both modes operate on persisted generation outputs — the evaluation step is decoupled from inference. Generation can be run once and evaluated multiple ways without re-running the model.

07Reproducibility

Retrieval traces, generation outputs, and evaluation scores are persisted to disk at each pipeline stage. Experiments can be reproduced by re-running the same manifest with the same random seed. The benchmark splits are fixed and versioned.

The retrieval index is built deterministically from the training rows and can be rebuilt from the corpus without reprocessing the benchmark examples.

08Tradeoffs

The benchmark assumes procedural knowledge transfer is the right frame — that adaptation improves by finding relevant procedural analogies. This may not hold for tasks where the adaptation requires reasoning rather than retrieval.

YouCook2 and WikiHow are constrained domains. Results on this benchmark are not necessarily predictive of performance on procedural adaptation in other procedural domains (scientific protocols, code, manufacturing).

09Limitations

Final model-scale benchmarking was not completed within the project scope. Results across VLM configurations should be treated as preliminary — the pipeline infrastructure is in place for evaluation, but the full comparison across model scales was not run to completion.

The LLM judge evaluation introduces its own consistency concerns: judge model scores can vary across runs and are sensitive to prompt wording. Rule-based metrics are more reproducible but have lower coverage of adaptation quality.

10What I learned

Key insight

Benchmark construction is design work — the framing determines what capabilities are visible and what are invisible. The disruption taxonomy is a design choice, not a neutral observation about the task.

Persisting retrieval traces as a first-class output separates retrieval quality from generation quality. Without traces, a generation failure is ambiguous — it could be the model or the retrieval. With traces, both are independently inspectable.