Model Architecture
AlignAIR leverages a multi-task deep residual convolutional architecture to simultaneously predict V, D, and J segmentation, allele classification, mutation rates, and productivity.
Input Representation
DNA sequences are integer-encoded and tokenized into fixed-length windows (default: 576 nt). Each nucleotide is embedded into a learned continuous representation. The embedded input is passed through multiple 1D convolutional layers.
Residual Convolutional Stack
The backbone of the network is a series of residual blocks with dilated 1D convolutions. This design ensures both local and global context capture without excessive depth. Batch normalization and dropout are used for regularization.
The network is symmetric and preserves sequence length, allowing predictions at each nucleotide position.
Multi-Task Output Heads
The model branches into multiple heads:
- V/D/J segmentation: Start and end coordinates for each gene segment.
- Allele classification: Likelihood distribution over known V, D, and J alleles.
- Mutation rate: A regression head to estimate per-sequence mutation level.
- Productivity prediction: Binary classification to determine if the sequence is productive.
Loss Function Design
AlignAIR optimizes a composite loss function combining:
- Cross-entropy for allele classification
- IoU-style regression loss for segmentation
- MSE loss for mutation prediction
- Binary cross-entropy for productivity
All losses are normalized and weighted to prevent dominance of any single task.
Efficiency and Parallelization
The convolutional architecture enables efficient GPU utilization and allows processing of thousands of sequences in parallel using large batch sizes. Model inference is fully parallelized over batch and sequence dimensions.
References
For a schematic and exact implementation details, see Supplementary Figure 2 and Section 1.4 of the AlignAIR manuscript.