Tutorials

Running simAIRR

simAIRR can be run through a command-line interface with a single user-supplied parameter - the path to the configuration specification file. Example below:

$ sim_airr -i <path_to_specification_file>

Configuration file format

The configuration for simAIRR’s simulations are to be supplied through a flat config file in YAML format. The snippets below show examples of the configuration for different usage modes of simAIRR. See simAIRR arguments for description and default settings of parameters.

Signal implantation mode (without much customisation)

In the example specification below, since no values were supplied for the non-required arguments, default settings will be used for some parameters (that are not shown in the snippet below). See simAIRR arguments for default settings of other parameters.

mode: signal_feasibility_assessment
olga_model: humanTRB
n_repertoires: 10
n_sequences: 10
n_threads: 2
signal_sequences_file: /path/to/ground_truth_signal_sequences.tsv
phenotype_burden: 3

Signal implantation mode (more customisation)

See acceptable file formats of signal sequences file File format of ground truth signal sequences

mode: signal_implantation
olga_model: humanTRB
output_path: /path/to/output/directory
n_repertoires: 10
seed: 1234
store_intermediate_files: True
n_sequences: 10
n_threads: 2
public_seq_proportion: 0.08 # 8% of the total unique sequences will be shared across repertoires
public_seq_pgen_count_mapping_file: /path/to/public_seq_pgen_count_mapping.tsv
signal_pgen_count_mapping_file: /path/to/signal_pgen_count_mapping.tsv
signal_sequences_file: /path/to/ground_truth_signal_sequences.tsv
positive_label_rate: 0.3 # 30% of the total repertoires will receive signal implantation
phenotype_burden: 3
phenotype_pool_size: 100 # use only any 100 of the total supplied sequences as a signal superset
allow_closer_phenotype_burden: False # whether to allow signal implantation if the feasible phenotype burden is closer to the user-desired phenotype burden (although it is not exact match)
export_nt: False # whether to export nucleotide sequences in the simulated repertoire files

Signal feasibility assessment mode

It is possible that user-supplied signal sequences cannot be sufficient to meet the user-desired specifications of implantation statistics including desired phenotype burden, number of positive class-labeled repertoires and so on based on realistic levels of population incidence given the generation probability.
In such a case, signal_implantation mode will not succeed in generating simulated repertoires. Rather, guidance will be provided on the signal implantation statistics (what is realistically possible) that will help the user to either tune the simulation parameters or supply a different set of signal sequences or both.
To help with this process of fine-tuning a set of signal sequences file and parameters of simulation, signal_feasibility_assessment mode coule be used. All the parameters of signal_implantation mode and signal_feasibility_assessment mode are identical. The only difference is that in signal_implantation mode, if implantation was deemed feasible, it will proceed with expensive computation steps of baseline repertoire generation and public component correction, whereas the signal_feasibility_assessment mode will stop even if the implantation was deemed feasible.

Baseline repertoires generation

Since the signal_implantation mode involves a sequence of steps that also involves baseline repertoires generation, this functionality is also made available to be run in a separate mode. Although generation of baseline repertoires can be accomplished with a few lines of code around existing tools, the parallelised version of this functionality implemented in baseline_repertoire_generation mode may turn out to be useful.

mode: baseline_repertoire_generation
olga_model: humanTRB
output_path: /path/to/output/directory
n_repertoires: 20
n_sequences: 10
n_threads: 2

Public component correction

When synthetic AIRR datasets are generated using sampling from know V(D)J recombination models using existing tools, the resulting repertoires represent naive repertoires that have not experienced any antigen events. Thus, the proportion of public (shared) sequences in such AIRR datasets will be lower than what is observed in experimental AIRR datasets of antigen-experienced repertoires. To match real-world experimental datasets in terms of public sequences, simAIRR’s workflows include a public component correction step, where a fraction of the total unique sequences in the synthetic AIRR dataset (public_seq_proportion) will be forced to be shared across repertoires. The sharing pattern will be determined based on empirically learnt relation between generation probability and population incidence of sequences. With sampling from known V(D)J recombination models, one cannot exclude the possibility of observing same sequence twice; public_component_correction mode filters out duplicate sequences before making the sequences public.

mode: baseline_repertoire_generation
olga_model: humanTRB
output_path: /path/to/output/directory
n_repertoires: 20
n_sequences: 10
n_threads: 2
public_seq_proportion: 0.12 # 12% of the total unique sequences will be shared across repertoires. Default is 10% if this argument is not supplied.
public_seq_pgen_count_mapping_file: /path/to/public_seq_pgen_count_mapping.tsv # default is a real-world experimental dataset calibrated mapping that is included with simAIRR

Querying sequences enriched for k-mer like patterns

Using simAIRR and the new bionumpy library, one could query and retrieve sequences enriched for k-mer like patterns very easily with just a few lines of Python code. Below, we provide a simple Python recipe.

Generation of reference sequences to query against

First, use the simAIRR’s Baseline repertoires generation mode to generate a large number of reference sequences to query against. For instance, in the code chunk below, we generate a single file with ten million sequences.

mode: baseline_repertoire_generation
olga_model: humanTRB
output_path: /path/to/output/directory
n_repertoires: 1
n_sequences: 10000000
n_threads: 1

Note that instead of generating 10 million sequences in a single file, if multiple processes are used using n_threads argument (if more CPUs are available), and the n_sequences are split across multiple repertoires using n_threads argument, the reference sequence generation will be much quicker. For instance, the following configuration will generate 10 million sequences across 40 files in less than 2 minutes wall time.

mode: baseline_repertoire_generation
olga_model: humanTRB
output_path: /path/to/output/directory
n_repertoires: 40
n_sequences: 250000
n_threads: 40

To combine them into a single file:

$ cat /path/to/output/directory/* > reference_sequences.txt

Alternatively, one could use the reference sequences of their choice to query the sequences for k-mer like patterns. The code chunks above is just one example to get a large number of sequences to make queries.
Let us assume that the ten million sequences that were generated in previous step are stored in a file named reference_sequences.txt.

Pattern matching recipe using bionumpy

Next, we import relevant functionalities from the bionumpy library to read in the reference_sequences.txt, create a lookup index and querying the sequences.

bionumpy works with popular file formats of biological sequence data. Since bionumpy does not know the file format of our reference_sequences.txt, we first need to tell bionumpy that the first two fields of this custom file format are a dna sequence and an amino acid sequence. The code chunk below shows how to do this. This is routine code and can just be copied as shown below.

>>> from bionumpy.io.delimited_buffers import DelimitedBuffer, get_bufferclass_for_datatype
>>> from bionumpy.bnpdataclass import bnpdataclass
>>> import bionumpy as bnp
>>> from bionumpy.sequence.indexing.wildcard_index import WildCardLookup

>>> @bnpdataclass
... class Olga:
...   dna: bnp.DNAEncoding
...   amino_acid: bnp.AminoAcidEncoding

>>> class OlgaBuffer(DelimitedBuffer):
...    dataclass = Olga

Using the newly created buffer type, we read-in the files with that file format.

>>> olga_sequence_data = bnp.open(filename="reference_sequences.txt", buffer_type=OlgaBuffer).read()
>>> olga_sequence_data
Olga with 10000000 entries
                      dna               amino_acid
  TGCGCCACCTGGGGGGACGA...               CATWGDEQYF
  TGTGCCAGCTCACCTACGAA...         CASSPTNSGSNYGYTF
  TGCGGGCCCGTAATGAACAC...              CGPVMNTEAFF
  TGTGCCAGCAGTGAAGCGCG...         CASSEARPARMYGYTF
  TGTGCCAGCAGTAGTGGGAC...          CASSSGTGPDQPQHF
  TGTGCCAGCAACCTAGCGGG...          CASNLAGKNTGELFF
  TGTGCCAGCAGCCAACCGGG...         CASSQPGGSGNYGYTF
  TGCGCCAGCAGCCGCGGCCT...           CASSRGLREETQYF
  TGTGCCAGCAGCCAAGTCTC...        CASSQVSRQDSSYEQYF
  TGTGCCAGCAGGCCGGGACA...     CASRPGQGAPGWEDNYGYTF

We create a wildcard lookup table on the amino acid sequences of the file that we read-in.

>>> aminoacid_wildcard_lookup = WildCardLookup.create_lookup(olga_sequence_data.amino_acid)
>>> aminoacid_wildcard_lookup
Lookup on WildcardIndex of 10000000 sequences

We then retrieve all the sequences that contain a pattern like “RG.”, where the “.” indicates a wildcard character.

>>> rg_wildcard_sequences = aminoacid_wildcard_lookup.get_sequences("RG.")
>>> rg_wildcard_sequences
encoded_ragged_array(['CASSRGLREETQYF',
                      'CASGCRGTSGGASLDEQFF',
                      'CATSDLGVRRGALIATNEKL...',
                      'CATRRGYGYTF',
                      'CATRGAKRIDEQYF',
                      'CASSLRGQGLLRGNQPQHF',
                      'CASSFSCLRGESSYNEQFF',
                      'CASRGLFPQPQHF',
                      'CASSGCRGGNTEAFF',
                      'CASSEADRGRKAFF',
                      'CASRVASRGRDKQPQHF'], AlphabetEncoding('ACDEFGHIKLMNPQRSTVWY'))

Presence of multiple motifs in sequences

One may be interested in multiple patterns in sequences. For instance, let us assume that one is interested in sequences that contain either of the following pattern: GG**GG or GG**YG. One can consider this as two cases:

One, where the user is interested in mining those sequences that contain patterns GG and either of another GG or YG that are always two residues apart, where the two residues can be any residues.
Alternatively, mining for the above patterns will also result in sequences that always contain a GG in addition to another GG or YG.

In either case, the following code block will mine the sequences of interest:

>>> gg_gg_wildcard_sequences = aminoacid_wildcard_lookup.get_sequences("GG..GG")
>>> gg_gg_wildcard_sequences
encoded_ragged_array(['CASSVKPSGGVAGGETQYF',
                      'CSARPGGGEGGENQPQHF',
                      'CASGGRAGGVWAYEQYF',
                      'CASSTEGGLAGGLTYNEQFF',
                        ...
                      'CASSGGPEGGFSSTDTQYF'], AlphabetEncoding('ACDEFGHIKLMNPQRSTVWY'))
>>> len(gg_gg_wildcard_sequences)
14968

>>> gg_yg_wildcard_sequences = aminoacid_wildcard_lookup.get_sequences("GG..YG")
>>> len(gg_yg_wildcard_sequences)
28070

If the user is interested in filtering the search results above based on further criteria like gene usage or for containing additional motifs, the resulting sequence set can be considered a proxy for complex signal based on multiple criteria.

Recipe using grep

Alternatively, one could use the unix grep function instead. If the user wants to query multiple patterns from the reference sequences, all the patterns could be listed in one file as shown below. In the example below, three patterns are listed in the file named query_patterns.txt.

$ cat query_patterns.txt
WKDY
ERFY
YREV

We then query the reference sequences using the patterns from the file query_patterns.txt and write the output to search_results.txt.

$ grep -f query_patterns.txt reference_sequences_file.txt > search_results.txt

We retrieved a total of 705 sequences from the 10 million sequences that contained any of the three patterns listed in query_patterns.txt.

$ wc -l search_results.txt
705

Conditioning on gene usage

If one is interested in using/testing sequences that match only a particular V or J gene, the search_results.txt could further be filtered to match the specific genes of interest. As an example, if the user is interested only in those sequences where TRBV5 or TRBV27 genes are used, one could further filter the search_results.txt as shown below:

$ grep "TRBV27\|TRBV5" search_results.txt > search_results_trbv5_trbv27.txt
$ wc -l search_results_trbv5_trbv27.txt
114

Wall time estimate for pattern matching

In the examples that we have shown above, generation of 10 million sequences on 40 CPUs took less than 1 minute wall time and the query of the reference sequences using either bionumpy or unix grep took less than 1 minute wall time when we queried three patterns. The query time can increase with increase in the number of patterns, but for AIRR benchmarking datasets-relevant number of patterns, the query time could still be reasonably efficient.

File format of ground truth signal sequences

The ground truth signal sequences file is expected to be a tab-delimited file with at least 3 columns and without a header line. See below for file formats that are accepted.

Supplying nucleotide sequences is optional. If nucleotide sequences are not supplied, nucleotide sequences are not exported in the simulated repertoires by default. This behavior can be changed with export_nt parameter.

When including nucleotide sequences
TGTGCCAGCAGTTTATCGCCGGGACTGGCCTACGAGCAGTACTTC	CASSLSPGLAYEQYF	TRBV27	TRBJ2-7
TGTGCCAGCAAAGTCAGAATTGCTGCAACTAATGAAAAACTGTTTTTT	CASKVRIAATNEKLFF	TRBV5-6	TRBJ1-4
TGCAGTGCCGACTCCAAGAACAGAGGAGCGGGGGGGGAGGCAAGCTCCTACGAGCAGTACTTC	CSADSKNRGAGGEASSYEQYF	TRBV20-1	TRBJ2-7
TGTGCCAGCATCGGTGGCGGGACTAGTCTCTCCTACAATGAGCAGTTCTTC	CASIGGGTSLSYNEQFF	TRBV7-9	TRBJ2-1

When not including nucleotide sequences
.	CASSLSPGLAYEQYF	TRBV27	TRBJ2-7
.	CASKVRIAATNEKLFF	TRBV5-6	TRBJ1-4
.	CSADSKNRGAGGEASSYEQYF	TRBV20-1	TRBJ2-7
.	CASIGGGTSLSYNEQFF	TRBV7-9	TRBJ2-1

When not including nucleotide sequences
NA	CASSLSPGLAYEQYF	TRBV27	TRBJ2-7
NA	CASKVRIAATNEKLFF	TRBV5-6	TRBJ1-4
NA	CSADSKNRGAGGEASSYEQYF	TRBV20-1	TRBJ2-7
NA	CASIGGGTSLSYNEQFF	TRBV7-9	TRBJ2-1

When not supplying nucleotide sequences field
CASSLSPGLAYEQYF	TRBV27	TRBJ2-7
CASKVRIAATNEKLFF	TRBV5-6	TRBJ1-4
CSADSKNRGAGGEASSYEQYF	TRBV20-1	TRBJ2-7
CASIGGGTSLSYNEQFF	TRBV7-9	TRBJ2-1

pgen_count_mapping file format

The user is not required to supply this file unless the user desires to generate simulated datasets based on a custom model derived from a specific experimental dataset.

For both the signal sequences and remaining public sequences, user could supply custom empirical relation between generation probability and population incidence. The file format for both of those files is shown below. The file should be tab-delimited with required fields: “pgen_left”, “pgen_right”, “sample_size_prop_left”, “sample_size_prop_right”, “prob”.
To prepare such files based on a dataset of interest, one should compute empirical probabilities (prob) of observing sequences within a range of population incidence levels (sample_size_prop_left and sample_size_prop_right) for a given range of generation probability (pgen_left and pgen_right).
An example of such file is shown below:

pgen_count_mapping file format and example
pgen_left	pgen_right	sample_size_prop_left	sample_size_prop_right	prob
-0.001	1e-20	0.0019299999999999999	0.00439	0.8867159551633675
-0.001	1e-20	0.00439	0.00586	0.04817553064631529
-0.001	1e-20	0.00732	0.0117	0.025280228953016935
-0.001	1e-20	0.00586	0.00732	0.022776055330312427
-0.001	1e-20	0.0117	0.019	0.010255187216789887
-0.001	1e-20	0.019	0.0512	0.004889100882423086
-0.001	1e-20	0.0512	0.943	0.0019079418077748629
1e-20	1e-15	0.0019299999999999999	0.00439	0.9316377432726761
1e-20	1e-15	0.00439	0.00586	0.03456755427527577
1e-20	1e-15	0.00586	0.00732	0.01352490690648493
1e-20	1e-15	0.00732	0.0117	0.012225110658329236
1e-20	1e-15	0.0117	0.019	0.0036183517178388254
1e-20	1e-15	0.019	0.0512	0.002845499894611115
1e-20	1e-15	0.0512	0.943	0.0015808332747839528
1e-15	1e-14	0.0019299999999999999	0.00439	0.8809934274323887
1e-15	1e-14	0.00439	0.00586	0.0603652623639694
1e-15	1e-14	0.00586	0.00732	0.026640448227561685
1e-15	1e-14	0.00732	0.0117	0.023273354164421937
1e-15	1e-14	0.0117	0.019	0.006599504363753905
1e-15	1e-14	0.019	0.0512	0.0019663829328736126
1e-15	1e-14	0.0512	0.943	0.0001616205150307079
1e-14	1e-13	0.0019299999999999999	0.00439	0.768952812809976
1e-14	1e-13	0.00439	0.00586	0.0895800859666525
1e-14	1e-13	0.00732	0.0117	0.057413442917198056
1e-14	1e-13	0.00586	0.00732	0.048438902271975816
1e-14	1e-13	0.0117	0.019	0.024892541684379575
1e-14	1e-13	0.019	0.0512	0.010249870105332766
1e-14	1e-13	0.0512	0.943	0.00047234424448538097
1e-13	1e-12	0.0019299999999999999	0.00439	0.7716265865063461
1e-13	1e-12	0.00439	0.00586	0.06808060565575595
1e-13	1e-12	0.00732	0.0117	0.05934090403028279
1e-13	1e-12	0.00586	0.00732	0.0412769984413271
1e-13	1e-12	0.0117	0.019	0.03406813627254509
1e-13	1e-12	0.019	0.0512	0.02265642395902917
1e-13	1e-12	0.0512	0.943	0.002950345134713872
1e-12	1e-11	0.0019299999999999999	0.00439	0.9637811113220949
1e-12	1e-11	0.00439	0.00586	0.013495931528718414
1e-12	1e-11	0.00732	0.0117	0.007609483019319085
1e-12	1e-11	0.00586	0.00732	0.00584425174589109
1e-12	1e-11	0.0117	0.019	0.004486922519709405
1e-12	1e-11	0.019	0.0512	0.003966495769774458
1e-12	1e-11	0.0512	0.943	0.0008158040944926191
1e-11	1e-10	0.0019299999999999999	0.00439	0.9675325841713831
1e-11	1e-10	0.00439	0.00586	0.02028032266705611
1e-11	1e-10	0.00586	0.00732	0.006322762334892417
1e-11	1e-10	0.00732	0.0117	0.004386728759275038
1e-11	1e-10	0.0117	0.019	0.0010441617146431771
1e-11	1e-10	0.019	0.0512	0.0003826770681938345
1e-11	1e-10	0.0512	0.943	5.076328455632499e-05
1e-10	1e-09	0.0019299999999999999	0.00439	0.8350908568888563
1e-10	1e-09	0.00439	0.00586	0.07976968571914192
1e-10	1e-09	0.00586	0.00732	0.036914283085982824
1e-10	1e-09	0.00732	0.0117	0.035715530564775576
1e-10	1e-09	0.0117	0.019	0.009978644525678853
1e-10	1e-09	0.019	0.0512	0.0024663183241324764
1e-10	1e-09	0.0512	0.943	6.468089143204572e-05
1e-09	1e-08	0.0019299999999999999	0.00439	0.4951827356202455
1e-09	1e-08	0.00732	0.0117	0.1390661087283864
1e-09	1e-08	0.00439	0.00586	0.12290976272212115
1e-09	1e-08	0.0117	0.019	0.08791235702984151
1e-09	1e-08	0.00586	0.00732	0.08482276197758802
1e-09	1e-08	0.019	0.0512	0.06275295072463237
1e-09	1e-08	0.0512	0.943	0.00735332319718501
1e-08	0.01	0.019	0.0512	0.26781019279511525
1e-08	0.01	0.0019299999999999999	0.00439	0.17710858570927396
1e-08	0.01	0.0512	0.943	0.16489284123474018
1e-08	0.01	0.0117	0.019	0.14483284651679812
1e-08	0.01	0.00732	0.0117	0.12801895233928157
1e-08	0.01	0.00439	0.00586	0.06279342029019784
1e-08	0.01	0.00586	0.00732	0.054543161114593064

Custom models of realistic receptor sharing for different AIRR loci

Note that in simAIRR package, we have provided empirical models for realistic receptor sequence sharing just for TCRB loci, based on the public availability of a large experimental dataset. For other AIRR loci, custom models will be added into future versions of simAIRR when large-scale experimental data for other loci becomes available. For now users are encouraged to generate their own custom models of realistic receptor sharing for other loci (if sufficiently large data available to them). Note that, based on our analyses, a sample size of 200 (100 each of cases and controls) approximates well the realistic receptor sequence sharing.
Through usecases_simairr python package, we provide console scripts that can be run through command line to learn custom models for realistic receptor sequence sharing. Each console script requires some input parameters from the users. To know which input arguments are required for each console script, we recommend using the --help argument of each console script. For instance, concat_airr --help.
Note that the following console scripts are expected to be run in a sequence.

Command	Description
`concat_airr`	Concatenate the repertoire files into two separate files that will be needed by compAIRR tool to count the frequencies of each unique clones
`run_compairr_seqcounts`	Run compAIRR tool to count the frequencies of each unique clones.
`compute_pgen_public`	Compute the generation probabilities of each sequence using OLGA tool.
`compute_pval`	Computes p-values as the probability of observing a public sequence in the same or higher number of repertoires as it was observed in for a given dataset.
`concat_pdata`	Concatenate the files generated through previous step to produce one large file that will be used in the next step
`gen_pgen_count_map`	Generate pgen_count_map files in the required format desired by simAIRR