AlignAIR Model Prediction Guide
This guide provides a detailed explanation of how to use the AlignAIR model to align Adaptive Immune Receptor (AIR) sequences using the AlignAIR Python tool. AlignAIR is a powerful deep learning-based aligner, designed to tackle the challenges of V(D)J recombination, somatic hypermutation (SHM), and other forms of sequence corruption. By leveraging GenAIRR for unbiased training data, the AlignAIR model outperforms traditional heuristic-based aligners.
After installing AlignAIR, the alignair_predict
command should be available in your command line. Below is a step-by-step guide covering the parameters, execution modes, and examples of how to use AlignAIR effectively.
1. Running AlignAIR
AlignAIR can be run in one of three available modes: cli
(default), yaml
, or interactive
. These modes allow users to choose between command-line execution, providing a YAML configuration file, or using an interactive prompt to input the required parameters.
For example, you can run the following command to use AlignAIR in cli
mode:
alignair_predict --mode cli --model_checkpoint /path/to/model \
--sequences /path/to/sequences.csv --save_path /save/here --chain_type heavy
2. Modes of Operation
- cli: Command-line interface where all arguments are passed as flags. This is the most direct way to run AlignAIR.
- yaml: YAML configuration file mode, where you provide all inputs in a structured YAML file. Use the
--config_file
argument to specify the YAML file path. - interactive: A user-friendly, question-and-answer interface that prompts you for input values.
3. Parameters
The script takes several input parameters that control various aspects of the alignment process. You can provide these parameters either via the command line or in a YAML configuration file. Each parameter is designed to provide flexibility for different types of input sequences, model configurations, and output requirements.
--mode
Specifies the mode of input. Choose from cli
, yaml
, or interactive
.
--config_file
Path to the YAML configuration file (only required in yaml
mode).
--model_checkpoint
Path to the pre-trained AlignAIR model's saved weights. Required for running predictions.
--save_path
Path to where the aligned sequences and results will be saved (usually in CSV format).
--chain_type
Specifies the type of chain to align. Choose between heavy
and light
.
--sequences
Path to the sequences file (CSV, TSV, or FASTA) containing sequences to be aligned.
--lambda_data_config
Path to the lambda chain DataConfig file. Default is D
.
--kappa_data_config
Path to the kappa chain DataConfig file. Default is D
.
--max_input_size
Maximum model input size in nucleotides. Default is 576
.
--batch_size
Number of sequences to process per batch. Default is 2048
.
--v_allele_threshold
Threshold for V allele assignment (percentage-based). Default is 0.1
.
--d_allele_threshold
Threshold for D allele assignment. Default is 0.1
.
--j_allele_threshold
Threshold for J allele assignment. Default is 0.1
.
--v_cap
Maximum number of V alleles that can be called per sequence. Default is 3
.
--d_cap
Maximum number of D alleles that can be called per sequence. Default is 3
.
--j_cap
Maximum number of J alleles that can be called per sequence. Default is 3
.
--translate_to_asc
Flag to translate allele names back to ASC names from IMGT format.
--fix_orientation
Preprocessing step to fix DNA orientation if reversed or complemented. Default is true
.
--custom_orientation_pipeline_path
Path to a custom orientation model (optional).
4. Example Usage
Here are a few examples of how you can run the AlignAIR tool:
CLI Mode
alignair_predict --mode cli --model_checkpoint /path/to/model \
--save_path /output/alignment.csv --chain_type heavy \
--sequences /data/sequences.csv --batch_size 1024 --v_allele_threshold 0.1
This command runs AlignAIR in CLI mode, using a pre-trained model, a heavy chain type, and a CSV file with sequences. The output is saved to /output/alignment.csv
.
YAML Mode
alignair_predict --mode yaml --config_file /path/to/config.yaml
This example shows how to use a YAML configuration file to input parameters.
Interactive Mode
alignair_predict --mode interactive
This command runs AlignAIR in interactive mode, prompting the user for each required parameter.
5. Pipeline Overview
Once the input parameters are provided, AlignAIR runs through several processing steps:
- cli: Command-line interface where all arguments are passed as flags. This is the most direct way to run AlignAIR.
- ConfigLoadStep: Loads the necessary configurations and prepares the system for alignment.
- FileNameExtractionStep: Extracts relevant file names from the input sequences.
- ModelLoadingStep: Loads the pre-trained AlignAIR model.
- BatchProcessingStep: Processes sequences in batches according to the specified batch size.
- CleanAndArrangeStep: Cleans and arranges the raw predictions into a structured format.
- SegmentCorrectionStep: Corrects segmentations based on the model’s output.
- MaxLikelihoodPercentageThresholdApplicationStep: Applies likelihood thresholds to select the best V, D, and J allele assignments.
- FinalizationStep: Finalizes the alignment process and saves the results to the output file.
6. Conclusion
This guide covers the essential steps for running the AlignAIR model prediction script in its different modes. AlignAIR offers flexibility and power, ensuring high accuracy in aligning immunoglobulin sequences and predicting V(D)J alleles. For more advanced configurations and usage, please refer to the AlignAIR documentation or contact the developers.