Thresholding Logic

AlignAIR uses a dynamic thresholding strategy to convert model likelihood outputs into final allele calls. This post-processing step ensures robustness while maintaining alignment accuracy.

Overview

The AlignAIR model outputs a likelihood vector for each of the V, D, and J gene segments. Each vector contains probabilities corresponding to each possible allele in the reference set. To determine the final predicted alleles, AlignAIR applies a Maximum Likelihood Thresholding method followed by a cap enforcement procedure.

Algorithm

  1. For each segment (V, D, J), let the output vector be p = [p_1, ..., p_n].
  2. Compute the maximum likelihood: pmax = max(p).
  3. Define a threshold: threshold = Φ × pmax, where Φ is a segment-specific parameter (e.g. 0.75 for V, 0.3 for D, 0.8 for J).
  4. Filter alleles: keep all p_i such that p_i ≥ threshold.
  5. Apply cap: if the number of alleles passing the threshold exceeds a predefined cap (e.g. 3), keep only the top scoring ones.

Intuition

This method captures the probabilistic nature of the model's predictions while maintaining a clear cutoff to reduce noise. For example, if multiple alleles are highly likely, the model retains all those above the dynamic threshold, instead of arbitrarily selecting the top-k. The cap prevents the system from becoming overly permissive.

Optimization Strategy

The optimal values of Φ and cap were selected via grid search to maximize agreement with ground truth labels while minimizing the number of alleles returned. This creates a balance between sensitivity (returning all plausible candidates) and specificity (not returning noise).

Special Case: D Region

Due to the short and highly mutated nature of D segments, an additional label called Short-D is added to the likelihood vector. If this label receives high probability (> 0.5), the model suppresses other D allele predictions using a penalty term. This ensures consistency between segmentation and classification, avoiding spurious allele calls when the D region is unreliable.

References

See supplementary section 1.5.2 in the AlignAIR manuscript for full implementation details.