AIMS Basics

In this section, we will focus on the absolute basics that you need to know about while running AIMS. This includes the Input Formatting of the files used for the analysis, the Core Functionalities of the software, and the Biophysical Properties that are the central pillars of the analysis. This information is key whether you’re using the GUI, the CLI, or the Jupyter notebook. The hope is that if there are any fundamental questions with the meaning of a given output, or necessary troubleshooting with inputs, the users can find them here. While details are provided here, users can also check out the Testing section to learn how to download example data and use this to compare formatting.

Input Formatting

At present, the AIMS software requires a specific input formatting to be read by the software. Future versions of the software will hopefully have more relaxed formatting requirements, assuming users request such functionality. Here we will go into specifics for each molecular species, but the general format maintained for all species involves a comma separated value (CSV) file with a header in the first row, and no associated metadata in the given file.

Immunoglobulin (TCR/Ab) Formatting

The immunoglobulin (Ig) input file formatting is the most flexible of the three, with options meant to satisfy the needs of any given AIMS input data. Specifically these data must contain only the complementarity determining region (CDR) loops, as AIMS excludes the framework regions of antibodies and TCRs in the analysis. An example of the proper input formatting for the first few lines of an input file can be seen below:

[Example file from AIMS/app/ab_testData/flu_poly.csv]
l1,l2,l3,h1,h2,h3
QSISSY,DAS,QHRSTWPPN,GGTFSSRA,IIPIFNTP,AREMATIFGRMDV
ESLLHSDGKTY,EVS,MQTIQLPGT,GGIMRRNG,IIAIFGTP,VASSGYHLHRETWGY
QDIKNY,HVS,HQCYNLPYT,GFIFGHFA,ISGGGLNT,ARFDSSGYNYVRGMVV
.
.
.

The key features are that 1. the general format must follow that of a comma separated value (csv) file. In this csv file each row represents a unique sequence, each column represents a given CDR loop, and each column is separated by a comma. 2. Each column must have a header with no sequence information in it. The contents of this header are not critical, as the standard AIMS Ig file loader disregards this header and replaces it. Descriptive headers are preferred, if only for downstream transparency of the data. 3. Single letter amino acid codes should be used, with only capitalized letters. The defined function “aimsLoad.convert_3let” can convert three letter codes to single letter codes [see API for more details… once I make this section]. 4. The sequences do not have any extranneous characters or spaces. Currently AIMS is capable of identifying an “X” in a sequence and by default removes these sequences. Any other non-amino acid single letter characters will result in an error, and spaces will be encoded into the AIMS matrix, which could confound analysis.

At present, AIMS does NOT identify other issues with sequences. Missing CDR loops in a sequence, spaces included in a sequence, or other miscellaneous mistakes will not result in an error, and could lead to inaccuracies in the downstream analysis. Either visual inspection or other quality control steps should be taken to ensure proper analysis. If analyzing multiple files at once, all input files must have the same number of loops.

Note

Currently, the AIMS GUI cannot analyze TCR/Ab inputs of 4 or 5 loops. For most repertoire analysis, the requirement to analyze such a dataset would be unexpected. Users should submit an issue on the GitHub if this analysis is needed for some reason.

MHC/MHC-Like Formatting

The current MHC/MHC-like analysis requires the most rigid and restrictive of all the input formatting, but this will hopefully change in future updates. Users can either input entire MHC sequences or just the platform domain sequences. Additionally, users can leverage these restrictive requirements to analyze other molecular species if needed. The initial input is a simple aligned FASTA, as seen below:

[Example file from AIMS/app/mhc_testData/hlaA_seqs.fasta]
>3VJ6_A Chain A, H-2 Class I Histocompatibility Antigen, D-37 Alpha Chain [Mus musculus]
------------------------------------------------------------
------------------------------------------------------------
-------------------------------------------------SPHSLRYFTTA
VSRPGLGEPRFIIVGYVDDTQFVRFDSDAENPRMEPRARWIEQEGPEYWERETWKAR
DMGRNFRVNLRTLLGYYNQSNDESHTLQWMYGCDVGPDGRLLRGYCQEAYDGQDYISLNE
DLRSWTANDIASQISKHKSEAVDEAH-QQRAYLQGPCVEWLHRYLRLGNETLQRSDPPKA
HVTHHPRSEDEVTLRCWALGFYPADITLTWQLNGEELTQDMELVETRPAGDGTFQKWAAV
VVPLGKEQYYTCHVYHEGLPEPLTLRWEPP------------------------------
-------------------------------------------------
>5VCL_A Chain A, H2-t23 Protein [Mus musculus]
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------MSSHSLRYFHTA
.
.
.

Importantly, each FASTA entry must be pre-aligned using BLAST or a similar alignment software. AIMS does not internally align the sequences, and requires that the inputs can be expected to be structurally very similar. For MHC and MHC-like molecules, this requirement is satisfied. If a subset of sequences align poorly for some reason, they can be included as a separate file. Each individual file will have its own user-specified region of the alignment that will ultimately be input into the analysis. The user specification can be done on-the-fly, or input as a separate file formatted as such:

[Example file from aims_immune/app_data/test_data/mhcs/ex_cd1_hla_uda_uaa.csv]
Name,S1s,S1e/H1s,H1e/S2s,S2e/H2s,H2e
cd1,124,167,209,262,303
hla,170,210,260,306,348
uda,2,49,93,152,193
uaa,2,49,93,152,193

The above file is formatted again as a comma separated value (csv), with the first column giving the name of the dataset, and the remaining columns identifying the start and end point of four distinct structural features in the provided FASTA alignment. Specifically for the analysis of MHC and MHC-like molecules, these four structural features are the beta-strand of the alpha 1 domain, the alpha helix of the alpha 1 domain, the beta-strand of the alpha 2 domain, and the alpha helix of the alpha 2 domain. Each number represents either the start of one structural feature, the end of another structural feature, or both. In the example file, for the hla alignmemnt (corresponding to the FASTA above) the first beta strand starts at alignment position 170 and ends at position 210. Likewise, the first alpha helix starts at position 210 and ends at position 260. And so on.

Currently, the Phyre server (http://www.sbg.bio.ic.ac.uk/phyre2/html/page.cgi?id=index) is recommended to identify these structural features. Other software may be used to identify the key structural features for analysis, but the numbering provided in standard Phyre outputs makes translation to the above csv file easy. Generally only one sequence should be necessary to be used as input, as structural similarity is a requirement for comparable analysis using AIMS. Users can take advantage of this ambiguity in the software to analyze any four connected structual features in evolutionarily and structurally related molecules in the AIMS GUI. Users comfortable with the Jupyter Notebooks can instead follow the Multi-Sequence Alignment formatting instructions.

Immunopeptidomics Formatting

This is the first of two sections that are not yet implemented in the AIMS GUI, but can be analyzed using the AIMS notebook or CLI. Specifically, AIMS can be used to analyze immunopeptidomics data. Again the input is simply a comma separated value (csv) formatted file. However, since the input should only have one column, the precise format is a little less important. An example can be seen below:

[Example file from aims_immune/app_data/test_data/peptides/pancreas_hla_atlas.csv]
sequence
ALVSGNNTVPF
TYRGVDLDQLL
NYIDIVKYV
SYIPIFPQ
NYFPGGVALI
.
.
.

Example data provided from the HLA Ligand Atlas (https://hla-ligand-atlas.org/welcome). In future releases, data related to mass spectrometry approaches used for the identification of these peptides will be included in the analysis. Metadata can be included in additional columns of a separate csv.

Multi-Sequence Alignment Formatting

Again, this multi-sequence alignment input is not yet available in the AIMS GUI, but is available in the notebook and CLI. As it turns out, the same file formatting that is used for loading MHC molecules works for the more general MSA input. The difference is that the old MHC module (and by extension, the GUI) required a subset of the MSA to be selected. This step is now optional, so if you’d like to pre-select certain regions of an MSA or input the entire MSA, you can do so! Careful though, very large sequences will likely process quite slowly in AIMS.

Core Functionalities

Functionalities coming soon!

Biophysical Properties

In generating the core biophysical property matrix of the AIMS analysis, the same 61 biophysical properties are used in all analyses, with an option to use fewer if the user decides to. The properties are listed in the table below:

Table of AIMS Biophysical Properties

Number

Property [Shorthand]

Decription

0

Hydrophobicity1 [Phob1]

Hydrophobicity Scale [-1,1]

1

Charge [Charge]

Charge [ec]

2

Hydrophobicity2 [Phob2]

Octanol-Interface Hydrophobicity Scale

3

Bulkiness [Bulk]

Side-Chain Bulkiness

4

Flexibility [Flex]

Side-Chain Flexibility

5

Kidera 1 [KD1]

Helix/Bend Preference

6

Kidera 2 [KD2]

Side-Chain Size

7

Kidera 3 [KD3]

Extended Structure Preference

8

Kidera 4 [KD4]

Hydrophobicity

9

Kidera 5 [KD5]

Double-bend Preference

10

Kidera 6 [KD6]

Flat Extended Preference

11

Kidera 7 [KD7]

Partial Specific Volume

12

Kidera 8 [KD8]

Occurrence in alpha-region

13

Kidera 9 [KD9]

pK-C

14

Kidera 10 [KD10]

Surrounding Hydrophobicity

15

Hotspot 1 [HS1]

Normalized Positional Residue Freq at Helix C-term

16

Hotspot 2 [HS2]

Normalized Positional Residue Freq at Helix C4-term

17

Hotspot 3 [HS3]

Spin-spin coupling constants

18

Hotspot 4 [HS4]

Random Parameter

19

Hotspot 5 [HS5]

pK-N

20

Hotspot 6 [HS6]

Alpha-Helix Indices for Beta-Proteins

21

Hotspot 7 [HS7]

Linker Propensity from 2-Linker Dataset

22

Hotspot 8 [HS8]

Linker Propensity from Long Dataset

23

Hotspot 9 [HS9]

Normalized Relative Freq of Helix End

24

Hotspot 10 [HS10]

Normalized Relative Freq of Double Bend

25

Hotspot 11 [HS11]

pK-COOH

26

Hotspot 12 [HS12]

Relative Mutability

27

Hotspot 13 [HS13]

Kerr-Constant Increments

28

Hotspot 14 [HS14]

Net Charge

29

Hotspot 15 [HS15]

Norm Freq Zeta-R

30

Hotspot 16 [HS16]

Hydropathy Scale

31

Hotspot 17 [HS17]

Ratio of Average Computed Composition

32

Hotspot 18 [HS18]

Intercept in Regression Analysis

33

Hotspot 19 [HS19]

Correlation coefficient in Reg Anal

34

Hotspot 20 [HS20]

Weights for Alpha-Helix at window pos

35

Hotspot 21 [HS21]

Weights for Beta-sheet at window pos -3

36

Hotspot 22 [HS22]

Weights for Beta-sheet at window pos 3

37

Hotspot 23 [HS23]

Weights for coil at win pos -5

38

Hotspot 24 [HS24]

Weights coil win pos -4

39

Hotspot 25 [HS25]

Weights coil win pos 6

40

Hotspot 26 [HS26]

Avg Rel Frac occur in AL

41

Hotspot 27 [HS27]

Avg Rel Frac occur in EL

42

Hotspot 28 [HS28]

Avg Rel Frac occur in A0

43

Hotspot 29 [HS29]

Rel Pref at N

44

Hotspot 30 [HS30]

Rel Pref at N1

45

Hotspot 31 [HS31]

Rel Pref at N2

46

Hotspot 32 [HS32]

Rel Pref at C1

47

Hotspot 33 [HS33]

Rel Pref at C

48

Hotspot 34 [HS34]

Information measure for extended without H-bond

49

Hotspot 35 [HS35]

Information measure for C-term turn

50

Hotspot 36 [HS36]

Loss of SC hydropathy by helix formation

51

Hotspot 37 [HS37]

Principal Component 4 (Sneath 1966)

52

Hotspot 38 [HS38]

Zimm-Bragg Parameter

53

Hotspot 39 [HS39]

Normalized Freq of ZetaR

54

Hotspot 40 [HS40]

Rel Pop Conformational State A

55

Hotspot 41 [HS41]

Rel Pop Conformational State C

56

Hotspot 42 [HS42]

Electron-Ion Interaction Potential

57

Hotspot 43 [HS43]

Free energy change of epsI to epsEx

58

Hotspot 44 [HS44]

Free energy change of alphaRI to alphaRH

59

Hotspot 45 [HS45]

Hydrophobicity coeff

60

Hotspot 46 [HS46]

Principal Property Value z3 (Wold et. al. 1987)

The so-called Kidera factors are from the published work:

Kidera et al. Statistical analysis of the physical properties of the 20 naturally occurring amino acids. Journal of Protein Chemistry (1985)

While the hotspot variables mentioned above are from:

Liu et al. Hot spot prediction in protein-protein interactions by an ensemble system. BMC Systems Biology (2018)