Model Architecture

AlignAIR leverages a multi-task deep residual convolutional architecture to simultaneously predict V, D, and J segmentation, allele classification, mutation rates, and productivity.

Input Representation

DNA sequences are integer-encoded and tokenized into fixed-length windows (default: 576 nt). Each nucleotide is embedded into a learned continuous representation. The embedded input is passed through multiple 1D convolutional layers.

Residual Convolutional Stack

The backbone of the network is a series of residual blocks with dilated 1D convolutions. This design ensures both local and global context capture without excessive depth. Batch normalization and dropout are used for regularization.

The network is symmetric and preserves sequence length, allowing predictions at each nucleotide position.

Multi-Task Output Heads

The model branches into multiple heads:

  • V/D/J segmentation: Start and end coordinates for each gene segment.
  • Allele classification: Likelihood distribution over known V, D, and J alleles.
  • Mutation rate: A regression head to estimate per-sequence mutation level.
  • Productivity prediction: Binary classification to determine if the sequence is productive.

Loss Function Design

AlignAIR optimizes a composite loss function combining:

  • Cross-entropy for allele classification
  • IoU-style regression loss for segmentation
  • MSE loss for mutation prediction
  • Binary cross-entropy for productivity

All losses are normalized and weighted to prevent dominance of any single task.

Efficiency and Parallelization

The convolutional architecture enables efficient GPU utilization and allows processing of thousands of sequences in parallel using large batch sizes. Model inference is fully parallelized over batch and sequence dimensions.

References

For a schematic and exact implementation details, see Supplementary Figure 2 and Section 1.4 of the AlignAIR manuscript.