Tutorials

Running simAIRR

simAIRR can be run through a command-line interface with a single user-supplied parameter - the path to the configuration specification file. Example below:

$ sim_airr -i <path_to_specification_file>

Configuration file format

The configuration for simAIRR’s simulations are to be supplied through a flat config file in YAML format. The snippets below show examples of the configuration for different usage modes of simAIRR. See simAIRR arguments for description and default settings of parameters.

Signal implantation mode (without much customisation)

In the example specification below, since no values were supplied for the non-required arguments, default settings will be used for some parameters (that are not shown in the snippet below). See simAIRR arguments for default settings of other parameters.

mode: signal_feasibility_assessment
olga_model: humanTRB
n_repertoires: 10
n_sequences: 10
n_threads: 2
signal_sequences_file: /path/to/ground_truth_signal_sequences.tsv
phenotype_burden: 3

Signal implantation mode (more customisation)

mode: signal_implantation
olga_model: humanTRB
output_path: /path/to/output/directory
n_repertoires: 10
seed: 1234
store_intermediate_files: True
n_sequences: 10
n_threads: 2
public_seq_proportion: 0.08 # 8% of the total unique sequences will be shared across repertoires
public_seq_pgen_count_mapping_file: /path/to/public_seq_pgen_count_mapping.tsv
signal_pgen_count_mapping_file: /path/to/signal_pgen_count_mapping.tsv
signal_sequences_file: /path/to/ground_truth_signal_sequences.tsv
positive_label_rate: 0.3 # 30% of the total repertoires will receive signal implantation
phenotype_burden: 3
phenotype_pool_size: 100 # use only any 100 of the total supplied sequences as a signal superset
allow_closer_phenotype_burden: False # whether to allow signal implantation if the feasible phenotype burden is closer to the user-desired phenotype burden (although it is not exact match)
export_nt: False # whether to export nucleotide sequences in the simulated repertoire files

Signal feasibility assessment mode

  • It is possible that user-supplied signal sequences cannot be sufficient to meet the user-desired specifications of implantation statistics including desired phenotype burden, number of positive class-labeled repertoires and so on based on realistic levels of population incidence given the generation probability.

  • In such a case, signal_implantation mode will not succeed in generating simulated repertoires. Rather, guidance will be provided on the signal implantation statistics (what is realistically possible) that will help the user to either tune the simulation parameters or supply a different set of signal sequences or both.

  • To help with this process of fine-tuning a set of signal sequences file and parameters of simulation, signal_feasibility_assessment mode coule be used. All the parameters of signal_implantation mode and signal_feasibility_assessment mode are identical. The only difference is that in signal_implantation mode, if implantation was deemed feasible, it will proceed with expensive computation steps of baseline repertoire generation and public component correction, whereas the signal_feasibility_assessment mode will stop even if the implantation was deemed feasible.

Baseline repertoires generation

Since the signal_implantation mode involves a sequence of steps that also involves baseline repertoires generation, this functionality is also made available to be run in a separate mode. Although generation of baseline repertoires can be accomplished with a few lines of code around existing tools, the parallelised version of this functionality implemented in baseline_repertoire_generation mode may turn out to be useful.

mode: baseline_repertoire_generation
olga_model: humanTRB
output_path: /path/to/output/directory
n_repertoires: 20
n_sequences: 10
n_threads: 2

Public component correction

When synthetic AIRR datasets are generated using sampling from know V(D)J recombination models using existing tools, the resulting repertoires represent naive repertoires that have not experienced any antigen events. Thus, the proportion of public (shared) sequences in such AIRR datasets will be lower than what is observed in experimental AIRR datasets of antigen-experienced repertoires. To match real-world experimental datasets in terms of public sequences, simAIRR’s workflows include a public component correction step, where a fraction of the total unique sequences in the synthetic AIRR dataset (public_seq_proportion) will be forced to be shared across repertoires. The sharing pattern will be determined based on empirically learnt relation between generation probability and population incidence of sequences. With sampling from known V(D)J recombination models, one cannot exclude the possibility of observing same sequence twice; public_component_correction mode filters out duplicate sequences before making the sequences public.

mode: baseline_repertoire_generation
olga_model: humanTRB
output_path: /path/to/output/directory
n_repertoires: 20
n_sequences: 10
n_threads: 2
public_seq_proportion: 0.12 # 12% of the total unique sequences will be shared across repertoires. Default is 10% if this argument is not supplied.
public_seq_pgen_count_mapping_file: /path/to/public_seq_pgen_count_mapping.tsv # default is a real-world experimental dataset calibrated mapping that is included with simAIRR

Querying sequences enriched for k-mer like patterns

Using simAIRR and the new bionumpy library, one could query and retrieve sequences enriched for k-mer like patterns very easily with just a few lines of Python code. Below, we provide a simple Python recipe.

Generation of reference sequences to query against

  • First, use the simAIRR’s Baseline repertoires generation mode to generate a large number of reference sequences to query against. For instance, in the code chunk below, we generate a single file with ten million sequences.

mode: baseline_repertoire_generation
olga_model: humanTRB
output_path: /path/to/output/directory
n_repertoires: 1
n_sequences: 10000000
n_threads: 1

Note that instead of generating 10 million sequences in a single file, if multiple processes are used using n_threads argument (if more CPUs are available), and the n_sequences are split across multiple repertoires using n_threads argument, the reference sequence generation will be much quicker. For instance, the following configuration will generate 10 million sequences across 40 files in less than 2 minutes wall time.

mode: baseline_repertoire_generation
olga_model: humanTRB
output_path: /path/to/output/directory
n_repertoires: 40
n_sequences: 250000
n_threads: 40

To combine them into a single file:

$ cat /path/to/output/directory/* > reference_sequences.txt
  • Alternatively, one could use the reference sequences of their choice to query the sequences for k-mer like patterns. The code chunks above is just one example to get a large number of sequences to make queries.

  • Let us assume that the ten million sequences that were generated in previous step are stored in a file named reference_sequences.txt.

Pattern matching recipe using bionumpy

  • Next, we import relevant functionalities from the bionumpy library to read in the reference_sequences.txt, create a lookup index and querying the sequences.

  • bionumpy works with popular file formats of biological sequence data. Since bionumpy does not know the file format of our reference_sequences.txt, we first need to tell bionumpy that the first two fields of this custom file format are a dna sequence and an amino acid sequence. The code chunk below shows how to do this. This is routine code and can just be copied as shown below.

    >>> from bionumpy.io.delimited_buffers import DelimitedBuffer, get_bufferclass_for_datatype
    >>> from bionumpy.bnpdataclass import bnpdataclass
    >>> import bionumpy as bnp
    >>> from bionumpy.sequence.indexing.wildcard_index import WildCardLookup
    
    >>> @bnpdataclass
    ... class Olga:
    ...   dna: bnp.DNAEncoding
    ...   amino_acid: bnp.AminoAcidEncoding
    
    >>> class OlgaBuffer(DelimitedBuffer):
    ...    dataclass = Olga
    
  • Using the newly created buffer type, we read-in the files with that file format.

    >>> olga_sequence_data = bnp.open(filename="reference_sequences.txt", buffer_type=OlgaBuffer).read()
    >>> olga_sequence_data
    Olga with 10000000 entries
                          dna               amino_acid
      TGCGCCACCTGGGGGGACGA...               CATWGDEQYF
      TGTGCCAGCTCACCTACGAA...         CASSPTNSGSNYGYTF
      TGCGGGCCCGTAATGAACAC...              CGPVMNTEAFF
      TGTGCCAGCAGTGAAGCGCG...         CASSEARPARMYGYTF
      TGTGCCAGCAGTAGTGGGAC...          CASSSGTGPDQPQHF
      TGTGCCAGCAACCTAGCGGG...          CASNLAGKNTGELFF
      TGTGCCAGCAGCCAACCGGG...         CASSQPGGSGNYGYTF
      TGCGCCAGCAGCCGCGGCCT...           CASSRGLREETQYF
      TGTGCCAGCAGCCAAGTCTC...        CASSQVSRQDSSYEQYF
      TGTGCCAGCAGGCCGGGACA...     CASRPGQGAPGWEDNYGYTF
    
  • We create a wildcard lookup table on the amino acid sequences of the file that we read-in.

    >>> aminoacid_wildcard_lookup = WildCardLookup.create_lookup(olga_sequence_data.amino_acid)
    >>> aminoacid_wildcard_lookup
    Lookup on WildcardIndex of 10000000 sequences
    
  • We then retrieve all the sequences that contain a pattern like “RG.”, where the “.” indicates a wildcard character.

    >>> rg_wildcard_sequences = aminoacid_wildcard_lookup.get_sequences("RG.")
    >>> rg_wildcard_sequences
    encoded_ragged_array(['CASSRGLREETQYF',
                          'CASGCRGTSGGASLDEQFF',
                          'CATSDLGVRRGALIATNEKL...',
                          'CATRRGYGYTF',
                          'CATRGAKRIDEQYF',
                          'CASSLRGQGLLRGNQPQHF',
                          'CASSFSCLRGESSYNEQFF',
                          'CASRGLFPQPQHF',
                          'CASSGCRGGNTEAFF',
                          'CASSEADRGRKAFF',
                          'CASRVASRGRDKQPQHF'], AlphabetEncoding('ACDEFGHIKLMNPQRSTVWY'))
    

Presence of multiple motifs in sequences

  • One may be interested in multiple patterns in sequences. For instance, let us assume that one is interested in sequences that contain either of the following pattern: GG**GG or GG**YG. One can consider this as two cases:

  1. One, where the user is interested in mining those sequences that contain patterns GG and either of another GG or YG that are always two residues apart, where the two residues can be any residues.

  2. Alternatively, mining for the above patterns will also result in sequences that always contain a GG in addition to another GG or YG.

In either case, the following code block will mine the sequences of interest:

>>> gg_gg_wildcard_sequences = aminoacid_wildcard_lookup.get_sequences("GG..GG")
>>> gg_gg_wildcard_sequences
encoded_ragged_array(['CASSVKPSGGVAGGETQYF',
                      'CSARPGGGEGGENQPQHF',
                      'CASGGRAGGVWAYEQYF',
                      'CASSTEGGLAGGLTYNEQFF',
                        ...
                      'CASSGGPEGGFSSTDTQYF'], AlphabetEncoding('ACDEFGHIKLMNPQRSTVWY'))
>>> len(gg_gg_wildcard_sequences)
14968
>>> gg_yg_wildcard_sequences = aminoacid_wildcard_lookup.get_sequences("GG..YG")
>>> len(gg_yg_wildcard_sequences)
28070

If the user is interested in filtering the search results above based on further criteria like gene usage or for containing additional motifs, the resulting sequence set can be considered a proxy for complex signal based on multiple criteria.

Recipe using grep

Alternatively, one could use the unix grep function instead. If the user wants to query multiple patterns from the reference sequences, all the patterns could be listed in one file as shown below. In the example below, three patterns are listed in the file named query_patterns.txt.

$ cat query_patterns.txt
WKDY
ERFY
YREV
  • We then query the reference sequences using the patterns from the file query_patterns.txt and write the output to search_results.txt.

$ grep -f query_patterns.txt reference_sequences_file.txt > search_results.txt
  • We retrieved a total of 705 sequences from the 10 million sequences that contained any of the three patterns listed in query_patterns.txt.

$ wc -l search_results.txt
705

Conditioning on gene usage

If one is interested in using/testing sequences that match only a particular V or J gene, the search_results.txt could further be filtered to match the specific genes of interest. As an example, if the user is interested only in those sequences where TRBV5 or TRBV27 genes are used, one could further filter the search_results.txt as shown below:

$ grep "TRBV27\|TRBV5" search_results.txt > search_results_trbv5_trbv27.txt
$ wc -l search_results_trbv5_trbv27.txt
114

Wall time estimate for pattern matching

  • In the examples that we have shown above, generation of 10 million sequences on 40 CPUs took less than 1 minute wall time and the query of the reference sequences using either bionumpy or unix grep took less than 1 minute wall time when we queried three patterns. The query time can increase with increase in the number of patterns, but for AIRR benchmarking datasets-relevant number of patterns, the query time could still be reasonably efficient.

File format of ground truth signal sequences

The ground truth signal sequences file is expected to be a tab-delimited file with at least 3 columns and without a header line. See below for file formats that are accepted.

Supplying nucleotide sequences is optional. If nucleotide sequences are not supplied, nucleotide sequences are not exported in the simulated repertoires by default. This behavior can be changed with export_nt parameter.

When including nucleotide sequences

TGTGCCAGCAGTTTATCGCCGGGACTGGCCTACGAGCAGTACTTC

CASSLSPGLAYEQYF

TRBV27

TRBJ2-7

TGTGCCAGCAAAGTCAGAATTGCTGCAACTAATGAAAAACTGTTTTTT

CASKVRIAATNEKLFF

TRBV5-6

TRBJ1-4

TGCAGTGCCGACTCCAAGAACAGAGGAGCGGGGGGGGAGGCAAGCTCCTACGAGCAGTACTTC

CSADSKNRGAGGEASSYEQYF

TRBV20-1

TRBJ2-7

TGTGCCAGCATCGGTGGCGGGACTAGTCTCTCCTACAATGAGCAGTTCTTC

CASIGGGTSLSYNEQFF

TRBV7-9

TRBJ2-1

When not including nucleotide sequences

.

CASSLSPGLAYEQYF

TRBV27

TRBJ2-7

.

CASKVRIAATNEKLFF

TRBV5-6

TRBJ1-4

.

CSADSKNRGAGGEASSYEQYF

TRBV20-1

TRBJ2-7

.

CASIGGGTSLSYNEQFF

TRBV7-9

TRBJ2-1

When not including nucleotide sequences

NA

CASSLSPGLAYEQYF

TRBV27

TRBJ2-7

NA

CASKVRIAATNEKLFF

TRBV5-6

TRBJ1-4

NA

CSADSKNRGAGGEASSYEQYF

TRBV20-1

TRBJ2-7

NA

CASIGGGTSLSYNEQFF

TRBV7-9

TRBJ2-1

When not supplying nucleotide sequences field

CASSLSPGLAYEQYF

TRBV27

TRBJ2-7

CASKVRIAATNEKLFF

TRBV5-6

TRBJ1-4

CSADSKNRGAGGEASSYEQYF

TRBV20-1

TRBJ2-7

CASIGGGTSLSYNEQFF

TRBV7-9

TRBJ2-1

pgen_count_mapping file format

The user is not required to supply this file unless the user desires to generate simulated datasets based on a custom model derived from a specific experimental dataset.

  • For both the signal sequences and remaining public sequences, user could supply custom empirical relation between generation probability and population incidence. The file format for both of those files is shown below. The file should be tab-delimited with required fields: “pgen_left”, “pgen_right”, “sample_size_prop_left”, “sample_size_prop_right”, “prob”.

  • To prepare such files based on a dataset of interest, one should compute empirical probabilities (prob) of observing sequences within a range of population incidence levels (sample_size_prop_left and sample_size_prop_right) for a given range of generation probability (pgen_left and pgen_right).

  • An example of such file is shown below:

pgen_count_mapping file format and example

pgen_left

pgen_right

sample_size_prop_left

sample_size_prop_right

prob

-0.001

1e-20

0.0019299999999999999

0.00439

0.8867159551633675

-0.001

1e-20

0.00439

0.00586

0.04817553064631529

-0.001

1e-20

0.00732

0.0117

0.025280228953016935

-0.001

1e-20

0.00586

0.00732

0.022776055330312427

-0.001

1e-20

0.0117

0.019

0.010255187216789887

-0.001

1e-20

0.019

0.0512

0.004889100882423086

-0.001

1e-20

0.0512

0.943

0.0019079418077748629

1e-20

1e-15

0.0019299999999999999

0.00439

0.9316377432726761

1e-20

1e-15

0.00439

0.00586

0.03456755427527577

1e-20

1e-15

0.00586

0.00732

0.01352490690648493

1e-20

1e-15

0.00732

0.0117

0.012225110658329236

1e-20

1e-15

0.0117

0.019

0.0036183517178388254

1e-20

1e-15

0.019

0.0512

0.002845499894611115

1e-20

1e-15

0.0512

0.943

0.0015808332747839528

1e-15

1e-14

0.0019299999999999999

0.00439

0.8809934274323887

1e-15

1e-14

0.00439

0.00586

0.0603652623639694

1e-15

1e-14

0.00586

0.00732

0.026640448227561685

1e-15

1e-14

0.00732

0.0117

0.023273354164421937

1e-15

1e-14

0.0117

0.019

0.006599504363753905

1e-15

1e-14

0.019

0.0512

0.0019663829328736126

1e-15

1e-14

0.0512

0.943

0.0001616205150307079

1e-14

1e-13

0.0019299999999999999

0.00439

0.768952812809976

1e-14

1e-13

0.00439

0.00586

0.0895800859666525

1e-14

1e-13

0.00732

0.0117

0.057413442917198056

1e-14

1e-13

0.00586

0.00732

0.048438902271975816

1e-14

1e-13

0.0117

0.019

0.024892541684379575

1e-14

1e-13

0.019

0.0512

0.010249870105332766

1e-14

1e-13

0.0512

0.943

0.00047234424448538097

1e-13

1e-12

0.0019299999999999999

0.00439

0.7716265865063461

1e-13

1e-12

0.00439

0.00586

0.06808060565575595

1e-13

1e-12

0.00732

0.0117

0.05934090403028279

1e-13

1e-12

0.00586

0.00732

0.0412769984413271

1e-13

1e-12

0.0117

0.019

0.03406813627254509

1e-13

1e-12

0.019

0.0512

0.02265642395902917

1e-13

1e-12

0.0512

0.943

0.002950345134713872

1e-12

1e-11

0.0019299999999999999

0.00439

0.9637811113220949

1e-12

1e-11

0.00439

0.00586

0.013495931528718414

1e-12

1e-11

0.00732

0.0117

0.007609483019319085

1e-12

1e-11

0.00586

0.00732

0.00584425174589109

1e-12

1e-11

0.0117

0.019

0.004486922519709405

1e-12

1e-11

0.019

0.0512

0.003966495769774458

1e-12

1e-11

0.0512

0.943

0.0008158040944926191

1e-11

1e-10

0.0019299999999999999

0.00439

0.9675325841713831

1e-11

1e-10

0.00439

0.00586

0.02028032266705611

1e-11

1e-10

0.00586

0.00732

0.006322762334892417

1e-11

1e-10

0.00732

0.0117

0.004386728759275038

1e-11

1e-10

0.0117

0.019

0.0010441617146431771

1e-11

1e-10

0.019

0.0512

0.0003826770681938345

1e-11

1e-10

0.0512

0.943

5.076328455632499e-05

1e-10

1e-09

0.0019299999999999999

0.00439

0.8350908568888563

1e-10

1e-09

0.00439

0.00586

0.07976968571914192

1e-10

1e-09

0.00586

0.00732

0.036914283085982824

1e-10

1e-09

0.00732

0.0117

0.035715530564775576

1e-10

1e-09

0.0117

0.019

0.009978644525678853

1e-10

1e-09

0.019

0.0512

0.0024663183241324764

1e-10

1e-09

0.0512

0.943

6.468089143204572e-05

1e-09

1e-08

0.0019299999999999999

0.00439

0.4951827356202455

1e-09

1e-08

0.00732

0.0117

0.1390661087283864

1e-09

1e-08

0.00439

0.00586

0.12290976272212115

1e-09

1e-08

0.0117

0.019

0.08791235702984151

1e-09

1e-08

0.00586

0.00732

0.08482276197758802

1e-09

1e-08

0.019

0.0512

0.06275295072463237

1e-09

1e-08

0.0512

0.943

0.00735332319718501

1e-08

0.01

0.019

0.0512

0.26781019279511525

1e-08

0.01

0.0019299999999999999

0.00439

0.17710858570927396

1e-08

0.01

0.0512

0.943

0.16489284123474018

1e-08

0.01

0.0117

0.019

0.14483284651679812

1e-08

0.01

0.00732

0.0117

0.12801895233928157

1e-08

0.01

0.00439

0.00586

0.06279342029019784

1e-08

0.01

0.00586

0.00732

0.054543161114593064

Custom models of realistic receptor sharing for different AIRR loci

  • Note that in simAIRR package, we have provided empirical models for realistic receptor sequence sharing just for TCRB loci, based on the public availability of a large experimental dataset. For other AIRR loci, custom models will be added into future versions of simAIRR when large-scale experimental data for other loci becomes available. For now users are encouraged to generate their own custom models of realistic receptor sharing for other loci (if sufficiently large data available to them). Note that, based on our analyses, a sample size of 200 (100 each of cases and controls) approximates well the realistic receptor sequence sharing.

  • Through usecases_simairr python package, we provide console scripts that can be run through command line to learn custom models for realistic receptor sequence sharing. Each console script requires some input parameters from the users. To know which input arguments are required for each console script, we recommend using the --help argument of each console script. For instance, concat_airr --help.

  • Note that the following console scripts are expected to be run in a sequence.

Command

Description

concat_airr

Concatenate the repertoire files into two separate files that will be needed by compAIRR tool to count the frequencies of each unique clones

run_compairr_seqcounts

Run compAIRR tool to count the frequencies of each unique clones.

compute_pgen_public

Compute the generation probabilities of each sequence using OLGA tool.

compute_pval

Computes p-values as the probability of observing a public sequence in the same or higher number of repertoires as it was observed in for a given dataset.

concat_pdata

Concatenate the files generated through previous step to produce one large file that will be used in the next step

gen_pgen_count_map

Generate pgen_count_map files in the required format desired by simAIRR