Tutorials
Running simAIRR
simAIRR can be run through a command-line interface with a single user-supplied parameter - the path to the configuration specification file. Example below:
$ sim_airr -i <path_to_specification_file>
Configuration file format
The configuration for simAIRR’s simulations are to be supplied through a flat config file in YAML format. The snippets below show examples of the configuration for different usage modes of simAIRR. See simAIRR arguments for description and default settings of parameters.
Signal implantation mode (without much customisation)
In the example specification below, since no values were supplied for the non-required arguments, default settings will be used for some parameters (that are not shown in the snippet below). See simAIRR arguments for default settings of other parameters.
mode: signal_feasibility_assessment
olga_model: humanTRB
n_repertoires: 10
n_sequences: 10
n_threads: 2
signal_sequences_file: /path/to/ground_truth_signal_sequences.tsv
phenotype_burden: 3
Signal implantation mode (more customisation)
See acceptable file formats of signal sequences file File format of ground truth signal sequences
mode: signal_implantation
olga_model: humanTRB
output_path: /path/to/output/directory
n_repertoires: 10
seed: 1234
store_intermediate_files: True
n_sequences: 10
n_threads: 2
public_seq_proportion: 0.08 # 8% of the total unique sequences will be shared across repertoires
public_seq_pgen_count_mapping_file: /path/to/public_seq_pgen_count_mapping.tsv
signal_pgen_count_mapping_file: /path/to/signal_pgen_count_mapping.tsv
signal_sequences_file: /path/to/ground_truth_signal_sequences.tsv
positive_label_rate: 0.3 # 30% of the total repertoires will receive signal implantation
phenotype_burden: 3
phenotype_pool_size: 100 # use only any 100 of the total supplied sequences as a signal superset
allow_closer_phenotype_burden: False # whether to allow signal implantation if the feasible phenotype burden is closer to the user-desired phenotype burden (although it is not exact match)
export_nt: False # whether to export nucleotide sequences in the simulated repertoire files
Signal feasibility assessment mode
It is possible that user-supplied signal sequences cannot be sufficient to meet the user-desired specifications of implantation statistics including desired phenotype burden, number of positive class-labeled repertoires and so on based on realistic levels of population incidence given the generation probability.
In such a case, signal_implantation mode will not succeed in generating simulated repertoires. Rather, guidance will be provided on the signal implantation statistics (what is realistically possible) that will help the user to either tune the simulation parameters or supply a different set of signal sequences or both.
To help with this process of fine-tuning a set of signal sequences file and parameters of simulation, signal_feasibility_assessment mode coule be used. All the parameters of signal_implantation mode and signal_feasibility_assessment mode are identical. The only difference is that in signal_implantation mode, if implantation was deemed feasible, it will proceed with expensive computation steps of baseline repertoire generation and public component correction, whereas the signal_feasibility_assessment mode will stop even if the implantation was deemed feasible.
Baseline repertoires generation
Since the signal_implantation mode involves a sequence of steps that also involves baseline repertoires generation, this functionality is also made available to be run in a separate mode. Although generation of baseline repertoires can be accomplished with a few lines of code around existing tools, the parallelised version of this functionality implemented in baseline_repertoire_generation mode may turn out to be useful.
mode: baseline_repertoire_generation
olga_model: humanTRB
output_path: /path/to/output/directory
n_repertoires: 20
n_sequences: 10
n_threads: 2
Public component correction
When synthetic AIRR datasets are generated using sampling from know V(D)J recombination models using existing tools, the resulting repertoires represent naive repertoires that have not experienced any antigen events. Thus, the proportion of public (shared) sequences in such AIRR datasets will be lower than what is observed in experimental AIRR datasets of antigen-experienced repertoires. To match real-world experimental datasets in terms of public sequences, simAIRR’s workflows include a public component correction step, where a fraction of the total unique sequences in the synthetic AIRR dataset (public_seq_proportion) will be forced to be shared across repertoires. The sharing pattern will be determined based on empirically learnt relation between generation probability and population incidence of sequences. With sampling from known V(D)J recombination models, one cannot exclude the possibility of observing same sequence twice; public_component_correction mode filters out duplicate sequences before making the sequences public.
mode: baseline_repertoire_generation
olga_model: humanTRB
output_path: /path/to/output/directory
n_repertoires: 20
n_sequences: 10
n_threads: 2
public_seq_proportion: 0.12 # 12% of the total unique sequences will be shared across repertoires. Default is 10% if this argument is not supplied.
public_seq_pgen_count_mapping_file: /path/to/public_seq_pgen_count_mapping.tsv # default is a real-world experimental dataset calibrated mapping that is included with simAIRR
Querying sequences enriched for k-mer like patterns
Using simAIRR and the new bionumpy library, one could query and retrieve sequences enriched for k-mer like patterns very easily with just a few lines of Python code. Below, we provide a simple Python recipe.
Generation of reference sequences to query against
First, use the simAIRR’s Baseline repertoires generation mode to generate a large number of reference sequences to query against. For instance, in the code chunk below, we generate a single file with ten million sequences.
mode: baseline_repertoire_generation
olga_model: humanTRB
output_path: /path/to/output/directory
n_repertoires: 1
n_sequences: 10000000
n_threads: 1
Note that instead of generating 10 million sequences in a single file, if multiple processes are used using n_threads argument (if more CPUs are available), and the n_sequences are split across multiple repertoires using n_threads argument, the reference sequence generation will be much quicker. For instance, the following configuration will generate 10 million sequences across 40 files in less than 2 minutes wall time.
mode: baseline_repertoire_generation
olga_model: humanTRB
output_path: /path/to/output/directory
n_repertoires: 40
n_sequences: 250000
n_threads: 40
To combine them into a single file:
$ cat /path/to/output/directory/* > reference_sequences.txt
Alternatively, one could use the reference sequences of their choice to query the sequences for k-mer like patterns. The code chunks above is just one example to get a large number of sequences to make queries.
Let us assume that the ten million sequences that were generated in previous step are stored in a file named
reference_sequences.txt
.
Pattern matching recipe using bionumpy
Next, we import relevant functionalities from the bionumpy library to read in the
reference_sequences.txt
, create a lookup index and querying the sequences.bionumpy works with popular file formats of biological sequence data. Since bionumpy does not know the file format of our
reference_sequences.txt
, we first need to tell bionumpy that the first two fields of this custom file format are a dna sequence and an amino acid sequence. The code chunk below shows how to do this. This is routine code and can just be copied as shown below.>>> from bionumpy.io.delimited_buffers import DelimitedBuffer, get_bufferclass_for_datatype >>> from bionumpy.bnpdataclass import bnpdataclass >>> import bionumpy as bnp >>> from bionumpy.sequence.indexing.wildcard_index import WildCardLookup
>>> @bnpdataclass ... class Olga: ... dna: bnp.DNAEncoding ... amino_acid: bnp.AminoAcidEncoding
>>> class OlgaBuffer(DelimitedBuffer): ... dataclass = Olga
Using the newly created buffer type, we read-in the files with that file format.
>>> olga_sequence_data = bnp.open(filename="reference_sequences.txt", buffer_type=OlgaBuffer).read() >>> olga_sequence_data Olga with 10000000 entries dna amino_acid TGCGCCACCTGGGGGGACGA... CATWGDEQYF TGTGCCAGCTCACCTACGAA... CASSPTNSGSNYGYTF TGCGGGCCCGTAATGAACAC... CGPVMNTEAFF TGTGCCAGCAGTGAAGCGCG... CASSEARPARMYGYTF TGTGCCAGCAGTAGTGGGAC... CASSSGTGPDQPQHF TGTGCCAGCAACCTAGCGGG... CASNLAGKNTGELFF TGTGCCAGCAGCCAACCGGG... CASSQPGGSGNYGYTF TGCGCCAGCAGCCGCGGCCT... CASSRGLREETQYF TGTGCCAGCAGCCAAGTCTC... CASSQVSRQDSSYEQYF TGTGCCAGCAGGCCGGGACA... CASRPGQGAPGWEDNYGYTF
We create a wildcard lookup table on the amino acid sequences of the file that we read-in.
>>> aminoacid_wildcard_lookup = WildCardLookup.create_lookup(olga_sequence_data.amino_acid) >>> aminoacid_wildcard_lookup Lookup on WildcardIndex of 10000000 sequences
We then retrieve all the sequences that contain a pattern like “RG.”, where the “.” indicates a wildcard character.
>>> rg_wildcard_sequences = aminoacid_wildcard_lookup.get_sequences("RG.") >>> rg_wildcard_sequences encoded_ragged_array(['CASSRGLREETQYF', 'CASGCRGTSGGASLDEQFF', 'CATSDLGVRRGALIATNEKL...', 'CATRRGYGYTF', 'CATRGAKRIDEQYF', 'CASSLRGQGLLRGNQPQHF', 'CASSFSCLRGESSYNEQFF', 'CASRGLFPQPQHF', 'CASSGCRGGNTEAFF', 'CASSEADRGRKAFF', 'CASRVASRGRDKQPQHF'], AlphabetEncoding('ACDEFGHIKLMNPQRSTVWY'))
Presence of multiple motifs in sequences
One may be interested in multiple patterns in sequences. For instance, let us assume that one is interested in sequences that contain either of the following pattern:
GG**GG
orGG**YG
. One can consider this as two cases:
One, where the user is interested in mining those sequences that contain patterns
GG
and either of anotherGG
orYG
that are always two residues apart, where the two residues can be any residues.Alternatively, mining for the above patterns will also result in sequences that always contain a
GG
in addition to anotherGG
orYG
.
In either case, the following code block will mine the sequences of interest:
>>> gg_gg_wildcard_sequences = aminoacid_wildcard_lookup.get_sequences("GG..GG")
>>> gg_gg_wildcard_sequences
encoded_ragged_array(['CASSVKPSGGVAGGETQYF',
'CSARPGGGEGGENQPQHF',
'CASGGRAGGVWAYEQYF',
'CASSTEGGLAGGLTYNEQFF',
...
'CASSGGPEGGFSSTDTQYF'], AlphabetEncoding('ACDEFGHIKLMNPQRSTVWY'))
>>> len(gg_gg_wildcard_sequences)
14968
>>> gg_yg_wildcard_sequences = aminoacid_wildcard_lookup.get_sequences("GG..YG")
>>> len(gg_yg_wildcard_sequences)
28070
If the user is interested in filtering the search results above based on further criteria like gene usage or for containing additional motifs, the resulting sequence set can be considered a proxy for complex signal based on multiple criteria.
Recipe using grep
Alternatively, one could use the unix grep
function instead. If the user wants to query multiple patterns from the reference sequences, all the patterns could be listed in one file as shown below. In the example below, three patterns are listed in the file named query_patterns.txt
.
$ cat query_patterns.txt
WKDY
ERFY
YREV
We then query the reference sequences using the patterns from the file
query_patterns.txt
and write the output tosearch_results.txt
.
$ grep -f query_patterns.txt reference_sequences_file.txt > search_results.txt
We retrieved a total of 705 sequences from the 10 million sequences that contained any of the three patterns listed in
query_patterns.txt
.
$ wc -l search_results.txt
705
Conditioning on gene usage
If one is interested in using/testing sequences that match only a particular V or J gene, the search_results.txt
could further be filtered to match the specific genes of interest. As an example, if the user is interested only in those sequences where TRBV5 or TRBV27 genes are used, one could further filter the search_results.txt
as shown below:
$ grep "TRBV27\|TRBV5" search_results.txt > search_results_trbv5_trbv27.txt
$ wc -l search_results_trbv5_trbv27.txt
114
Wall time estimate for pattern matching
In the examples that we have shown above, generation of 10 million sequences on 40 CPUs took less than 1 minute wall time and the query of the reference sequences using either bionumpy or unix grep took less than 1 minute wall time when we queried three patterns. The query time can increase with increase in the number of patterns, but for AIRR benchmarking datasets-relevant number of patterns, the query time could still be reasonably efficient.
File format of ground truth signal sequences
The ground truth signal sequences file is expected to be a tab-delimited file with at least 3 columns and without a header line. See below for file formats that are accepted.
Supplying nucleotide sequences is optional. If nucleotide sequences are not supplied, nucleotide sequences are not exported in the simulated repertoires by default. This behavior can be changed with export_nt parameter.
TGTGCCAGCAGTTTATCGCCGGGACTGGCCTACGAGCAGTACTTC |
CASSLSPGLAYEQYF |
TRBV27 |
TRBJ2-7 |
TGTGCCAGCAAAGTCAGAATTGCTGCAACTAATGAAAAACTGTTTTTT |
CASKVRIAATNEKLFF |
TRBV5-6 |
TRBJ1-4 |
TGCAGTGCCGACTCCAAGAACAGAGGAGCGGGGGGGGAGGCAAGCTCCTACGAGCAGTACTTC |
CSADSKNRGAGGEASSYEQYF |
TRBV20-1 |
TRBJ2-7 |
TGTGCCAGCATCGGTGGCGGGACTAGTCTCTCCTACAATGAGCAGTTCTTC |
CASIGGGTSLSYNEQFF |
TRBV7-9 |
TRBJ2-1 |
. |
CASSLSPGLAYEQYF |
TRBV27 |
TRBJ2-7 |
. |
CASKVRIAATNEKLFF |
TRBV5-6 |
TRBJ1-4 |
. |
CSADSKNRGAGGEASSYEQYF |
TRBV20-1 |
TRBJ2-7 |
. |
CASIGGGTSLSYNEQFF |
TRBV7-9 |
TRBJ2-1 |
NA |
CASSLSPGLAYEQYF |
TRBV27 |
TRBJ2-7 |
NA |
CASKVRIAATNEKLFF |
TRBV5-6 |
TRBJ1-4 |
NA |
CSADSKNRGAGGEASSYEQYF |
TRBV20-1 |
TRBJ2-7 |
NA |
CASIGGGTSLSYNEQFF |
TRBV7-9 |
TRBJ2-1 |
CASSLSPGLAYEQYF |
TRBV27 |
TRBJ2-7 |
CASKVRIAATNEKLFF |
TRBV5-6 |
TRBJ1-4 |
CSADSKNRGAGGEASSYEQYF |
TRBV20-1 |
TRBJ2-7 |
CASIGGGTSLSYNEQFF |
TRBV7-9 |
TRBJ2-1 |
pgen_count_mapping file format
The user is not required to supply this file unless the user desires to generate simulated datasets based on a custom model derived from a specific experimental dataset.
For both the signal sequences and remaining public sequences, user could supply custom empirical relation between generation probability and population incidence. The file format for both of those files is shown below. The file should be tab-delimited with required fields: “pgen_left”, “pgen_right”, “sample_size_prop_left”, “sample_size_prop_right”, “prob”.
To prepare such files based on a dataset of interest, one should compute empirical probabilities (prob) of observing sequences within a range of population incidence levels (sample_size_prop_left and sample_size_prop_right) for a given range of generation probability (pgen_left and pgen_right).
An example of such file is shown below:
pgen_left |
pgen_right |
sample_size_prop_left |
sample_size_prop_right |
prob |
---|---|---|---|---|
-0.001 |
1e-20 |
0.0019299999999999999 |
0.00439 |
0.8867159551633675 |
-0.001 |
1e-20 |
0.00439 |
0.00586 |
0.04817553064631529 |
-0.001 |
1e-20 |
0.00732 |
0.0117 |
0.025280228953016935 |
-0.001 |
1e-20 |
0.00586 |
0.00732 |
0.022776055330312427 |
-0.001 |
1e-20 |
0.0117 |
0.019 |
0.010255187216789887 |
-0.001 |
1e-20 |
0.019 |
0.0512 |
0.004889100882423086 |
-0.001 |
1e-20 |
0.0512 |
0.943 |
0.0019079418077748629 |
1e-20 |
1e-15 |
0.0019299999999999999 |
0.00439 |
0.9316377432726761 |
1e-20 |
1e-15 |
0.00439 |
0.00586 |
0.03456755427527577 |
1e-20 |
1e-15 |
0.00586 |
0.00732 |
0.01352490690648493 |
1e-20 |
1e-15 |
0.00732 |
0.0117 |
0.012225110658329236 |
1e-20 |
1e-15 |
0.0117 |
0.019 |
0.0036183517178388254 |
1e-20 |
1e-15 |
0.019 |
0.0512 |
0.002845499894611115 |
1e-20 |
1e-15 |
0.0512 |
0.943 |
0.0015808332747839528 |
1e-15 |
1e-14 |
0.0019299999999999999 |
0.00439 |
0.8809934274323887 |
1e-15 |
1e-14 |
0.00439 |
0.00586 |
0.0603652623639694 |
1e-15 |
1e-14 |
0.00586 |
0.00732 |
0.026640448227561685 |
1e-15 |
1e-14 |
0.00732 |
0.0117 |
0.023273354164421937 |
1e-15 |
1e-14 |
0.0117 |
0.019 |
0.006599504363753905 |
1e-15 |
1e-14 |
0.019 |
0.0512 |
0.0019663829328736126 |
1e-15 |
1e-14 |
0.0512 |
0.943 |
0.0001616205150307079 |
1e-14 |
1e-13 |
0.0019299999999999999 |
0.00439 |
0.768952812809976 |
1e-14 |
1e-13 |
0.00439 |
0.00586 |
0.0895800859666525 |
1e-14 |
1e-13 |
0.00732 |
0.0117 |
0.057413442917198056 |
1e-14 |
1e-13 |
0.00586 |
0.00732 |
0.048438902271975816 |
1e-14 |
1e-13 |
0.0117 |
0.019 |
0.024892541684379575 |
1e-14 |
1e-13 |
0.019 |
0.0512 |
0.010249870105332766 |
1e-14 |
1e-13 |
0.0512 |
0.943 |
0.00047234424448538097 |
1e-13 |
1e-12 |
0.0019299999999999999 |
0.00439 |
0.7716265865063461 |
1e-13 |
1e-12 |
0.00439 |
0.00586 |
0.06808060565575595 |
1e-13 |
1e-12 |
0.00732 |
0.0117 |
0.05934090403028279 |
1e-13 |
1e-12 |
0.00586 |
0.00732 |
0.0412769984413271 |
1e-13 |
1e-12 |
0.0117 |
0.019 |
0.03406813627254509 |
1e-13 |
1e-12 |
0.019 |
0.0512 |
0.02265642395902917 |
1e-13 |
1e-12 |
0.0512 |
0.943 |
0.002950345134713872 |
1e-12 |
1e-11 |
0.0019299999999999999 |
0.00439 |
0.9637811113220949 |
1e-12 |
1e-11 |
0.00439 |
0.00586 |
0.013495931528718414 |
1e-12 |
1e-11 |
0.00732 |
0.0117 |
0.007609483019319085 |
1e-12 |
1e-11 |
0.00586 |
0.00732 |
0.00584425174589109 |
1e-12 |
1e-11 |
0.0117 |
0.019 |
0.004486922519709405 |
1e-12 |
1e-11 |
0.019 |
0.0512 |
0.003966495769774458 |
1e-12 |
1e-11 |
0.0512 |
0.943 |
0.0008158040944926191 |
1e-11 |
1e-10 |
0.0019299999999999999 |
0.00439 |
0.9675325841713831 |
1e-11 |
1e-10 |
0.00439 |
0.00586 |
0.02028032266705611 |
1e-11 |
1e-10 |
0.00586 |
0.00732 |
0.006322762334892417 |
1e-11 |
1e-10 |
0.00732 |
0.0117 |
0.004386728759275038 |
1e-11 |
1e-10 |
0.0117 |
0.019 |
0.0010441617146431771 |
1e-11 |
1e-10 |
0.019 |
0.0512 |
0.0003826770681938345 |
1e-11 |
1e-10 |
0.0512 |
0.943 |
5.076328455632499e-05 |
1e-10 |
1e-09 |
0.0019299999999999999 |
0.00439 |
0.8350908568888563 |
1e-10 |
1e-09 |
0.00439 |
0.00586 |
0.07976968571914192 |
1e-10 |
1e-09 |
0.00586 |
0.00732 |
0.036914283085982824 |
1e-10 |
1e-09 |
0.00732 |
0.0117 |
0.035715530564775576 |
1e-10 |
1e-09 |
0.0117 |
0.019 |
0.009978644525678853 |
1e-10 |
1e-09 |
0.019 |
0.0512 |
0.0024663183241324764 |
1e-10 |
1e-09 |
0.0512 |
0.943 |
6.468089143204572e-05 |
1e-09 |
1e-08 |
0.0019299999999999999 |
0.00439 |
0.4951827356202455 |
1e-09 |
1e-08 |
0.00732 |
0.0117 |
0.1390661087283864 |
1e-09 |
1e-08 |
0.00439 |
0.00586 |
0.12290976272212115 |
1e-09 |
1e-08 |
0.0117 |
0.019 |
0.08791235702984151 |
1e-09 |
1e-08 |
0.00586 |
0.00732 |
0.08482276197758802 |
1e-09 |
1e-08 |
0.019 |
0.0512 |
0.06275295072463237 |
1e-09 |
1e-08 |
0.0512 |
0.943 |
0.00735332319718501 |
1e-08 |
0.01 |
0.019 |
0.0512 |
0.26781019279511525 |
1e-08 |
0.01 |
0.0019299999999999999 |
0.00439 |
0.17710858570927396 |
1e-08 |
0.01 |
0.0512 |
0.943 |
0.16489284123474018 |
1e-08 |
0.01 |
0.0117 |
0.019 |
0.14483284651679812 |
1e-08 |
0.01 |
0.00732 |
0.0117 |
0.12801895233928157 |
1e-08 |
0.01 |
0.00439 |
0.00586 |
0.06279342029019784 |
1e-08 |
0.01 |
0.00586 |
0.00732 |
0.054543161114593064 |
Custom models of realistic receptor sharing for different AIRR loci
Note that in simAIRR package, we have provided empirical models for realistic receptor sequence sharing just for TCRB loci, based on the public availability of a large experimental dataset. For other AIRR loci, custom models will be added into future versions of simAIRR when large-scale experimental data for other loci becomes available. For now users are encouraged to generate their own custom models of realistic receptor sharing for other loci (if sufficiently large data available to them). Note that, based on our analyses, a sample size of 200 (100 each of cases and controls) approximates well the realistic receptor sequence sharing.
Through usecases_simairr python package, we provide console scripts that can be run through command line to learn custom models for realistic receptor sequence sharing. Each console script requires some input parameters from the users. To know which input arguments are required for each console script, we recommend using the
--help
argument of each console script. For instance,concat_airr --help
.Note that the following console scripts are expected to be run in a sequence.
Command |
Description |
---|---|
|
Concatenate the repertoire files into two separate files that will be needed by compAIRR tool to count the frequencies of each unique clones |
|
Run compAIRR tool to count the frequencies of each unique clones. |
|
Compute the generation probabilities of each sequence using OLGA tool. |
|
Computes p-values as the probability of observing a public sequence in the same or higher number of repertoires as it was observed in for a given dataset. |
|
Concatenate the files generated through previous step to produce one large file that will be used in the next step |
|
Generate pgen_count_map files in the required format desired by simAIRR |