Mutation Models
AlignAIR was designed to be robust to somatic hypermutation (SHM). Central to this is the training of AlignAIR on synthetic repertoires generated under realistic SHM models, most notably the S5F model.
What is the S5F Model?
The S5F model is an empirically derived 5-mer context-dependent somatic hypermutation model. It defines mutation probabilities based on the central nucleotide of a 5-mer and its surrounding context. This model was learned from a large corpus of human B cell sequences and captures the mutational preferences of the activation-induced cytidine deaminase (AID) enzyme.
By sampling mutations according to this context-aware probability matrix, the S5F model introduces highly realistic and biologically plausible SHM patterns.
Role in AlignAIR
AlignAIR is trained on simulated sequences generated by GenAIRR using the S5F model. These mutated sequences mimic real B cell receptor sequences, including high rates of mutation around known hotspots and conserved motifs. This is critical for training AlignAIR to:
- Handle insertions and deletions realistically
- Remain accurate even at high mutation loads
- Discriminate between alleles despite SHM distortion
- Learn context-aware segmentation and classification
Future Flexibility
One of AlignAIR's strengths lies in its adaptability: as more accurate or lineage-specific SHM models become available, new training datasets can be generated using those models and used to retrain AlignAIR. This ensures the framework remains extensible and biologically relevant.
For example, future SHM models that incorporate epigenetic context, chromatin accessibility, or clonal lineage effects could be integrated into GenAIRR to produce even more realistic synthetic repertoires.
Reference in the Manuscript
The usage and evaluation of mutation models, including the S5F and its variants (such as S5F Opposite and S5F 60), are discussed in Section 1.5.1 of the AlignAIR manuscript. These variations test the robustness of AlignAIR to distributional shifts in SHM.