AlignAIR Model Prediction Guide

This guide provides a detailed explanation of how to use the AlignAIR model to align Adaptive Immune Receptor (AIR) sequences using the AlignAIR Python tool. AlignAIR is a powerful deep learning-based aligner, designed to tackle the challenges of V(D)J recombination, somatic hypermutation (SHM), and other forms of sequence corruption. By leveraging GenAIRR for unbiased training data, the AlignAIR model outperforms traditional heuristic-based aligners.

After installing AlignAIR, the alignair_predict command should be available in your command line. Below is a step-by-step guide covering the parameters, execution modes, and examples of how to use AlignAIR effectively.

1. Running AlignAIR

AlignAIR can be run in one of three available modes: cli (default), yaml, or interactive. These modes allow users to choose between command-line execution, providing a YAML configuration file, or using an interactive prompt to input the required parameters.

For example, you can run the following command to use AlignAIR in cli mode:

alignair_predict --mode cli --model_checkpoint /path/to/model \
  --sequences /path/to/sequences.csv --save_path /save/here --chain_type heavy

2. Modes of Operation

cli: Command-line interface where all arguments are passed as flags. This is the most direct way to run AlignAIR.
yaml: YAML configuration file mode, where you provide all inputs in a structured YAML file. Use the --config_file argument to specify the YAML file path.
interactive: A user-friendly, question-and-answer interface that prompts you for input values.

3. Parameters

The script takes several input parameters that control various aspects of the alignment process. You can provide these parameters either via the command line or in a YAML configuration file. Each parameter is designed to provide flexibility for different types of input sequences, model configurations, and output requirements.

--mode

Specifies the mode of input. Choose from cli, yaml, or interactive.

--config_file

Path to the YAML configuration file (only required in yaml mode).

--model_checkpoint

Path to the pre-trained AlignAIR model's saved weights. Required for running predictions.

--save_path

Path to where the aligned sequences and results will be saved (usually in CSV format).

--chain_type

Specifies the type of chain to align. Choose between heavy and light.

--sequences

Path to the sequences file (CSV, TSV, or FASTA) containing sequences to be aligned.

--lambda_data_config

Path to the lambda chain DataConfig file. Default is D.

--kappa_data_config

Path to the kappa chain DataConfig file. Default is D.

--max_input_size

Maximum model input size in nucleotides. Default is 576.

--batch_size

Number of sequences to process per batch. Default is 2048.

--v_allele_threshold

Threshold for V allele assignment (percentage-based). Default is 0.1.

--d_allele_threshold

Threshold for D allele assignment. Default is 0.1.

--j_allele_threshold

Threshold for J allele assignment. Default is 0.1.

--v_cap

Maximum number of V alleles that can be called per sequence. Default is 3.

--d_cap

Maximum number of D alleles that can be called per sequence. Default is 3.

--j_cap

Maximum number of J alleles that can be called per sequence. Default is 3.

--translate_to_asc

Flag to translate allele names back to ASC names from IMGT format.

--fix_orientation

Preprocessing step to fix DNA orientation if reversed or complemented. Default is true.

--custom_orientation_pipeline_path

Path to a custom orientation model (optional).

4. Example Usage

Here are a few examples of how you can run the AlignAIR tool:

CLI Mode

alignair_predict --mode cli --model_checkpoint /path/to/model \
  --save_path /output/alignment.csv --chain_type heavy \
  --sequences /data/sequences.csv --batch_size 1024 --v_allele_threshold 0.1

This command runs AlignAIR in CLI mode, using a pre-trained model, a heavy chain type, and a CSV file with sequences. The output is saved to /output/alignment.csv.

YAML Mode

alignair_predict --mode yaml --config_file /path/to/config.yaml

This example shows how to use a YAML configuration file to input parameters.

Interactive Mode

alignair_predict --mode interactive

This command runs AlignAIR in interactive mode, prompting the user for each required parameter.

5. Pipeline Overview

Once the input parameters are provided, AlignAIR runs through several processing steps:

cli: Command-line interface where all arguments are passed as flags. This is the most direct way to run AlignAIR.
ConfigLoadStep: Loads the necessary configurations and prepares the system for alignment.
FileNameExtractionStep: Extracts relevant file names from the input sequences.
ModelLoadingStep: Loads the pre-trained AlignAIR model.
BatchProcessingStep: Processes sequences in batches according to the specified batch size.
CleanAndArrangeStep: Cleans and arranges the raw predictions into a structured format.
SegmentCorrectionStep: Corrects segmentations based on the model’s output.
MaxLikelihoodPercentageThresholdApplicationStep: Applies likelihood thresholds to select the best V, D, and J allele assignments.
FinalizationStep: Finalizes the alignment process and saves the results to the output file.

6. Conclusion

This guide covers the essential steps for running the AlignAIR model prediction script in its different modes. AlignAIR offers flexibility and power, ensuring high accuracy in aligning immunoglobulin sequences and predicting V(D)J alleles. For more advanced configurations and usage, please refer to the AlignAIR documentation or contact the developers.