Snekmer Demo

In this notebook, we will demonstrate how to apply Snekmer toward the analysis of protein sequences.

Getting Started

Workflow

Snekmer proceeds through a defined workflow executed as individual steps on Snakemake. Two operation modes are available: model (supervised machine learning) and cluster (unsupervised clustering). The user should select the mode that best suits their individual use case.

|6eda3ed6f3b5486d963d4e12aa1fe972|

Notes on Using Snekmer

Snekmer assumes that the user will primarily process input files using the command line. For more detailed instructions, refer to the documentation.

The basic process for running Snekmer is as follows:

  1. Verify that your file directory structure is correct and that the top-level directory contains a config.yaml file.

    • A template configuration file is included in the Snekmer code repository here.

  2. Modify config.yaml as needed.

  3. Use the command line to navigate to the directory containing both the config.yaml file and input directory.

  4. Run snekmer cluster, snekmer model, or snekmer search.

Depending on the selected operation mode, output files will vary.

The process detailed above is handled by the included tutorial. We will use this demo to break down the process followed by the tutorial and understand the output files created in the process.

Running Snekmer

First, install Snekmer using the instructions in the user installation guide.

To ensure that the tutorial runs correctly, activate the conda environment containing your Snekmer installation and run the notebook from the environment.

Next, run the Snekmer tutorial. This runs all three Snekmer modes on the demo example files and produces all output files. The tutorial uses the included default configuration parameters to guide the analysis, but the user can modify these parameters if a different configuration set is desired. The tutorial command line instructions are copied below:

conda activate snekmer
cd resources/tutorial/demo_example
./run_demo.sh

Finally, we will initialize some parameters and parse filenames for this demo notebook.

[1]:
# imports
import glob
import os
import yaml
from itertools import product

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
[2]:
# load config file
with open(os.path.join("..", "..", "resources", "config.yaml"), "r") as configfile:
    config = yaml.safe_load(configfile)

print(config)
{'k': 14, 'alphabet': 0, 'input_file_exts': ['fasta', 'fna', 'faa', 'fa'], 'input_file_regex': '.*', 'nested_output': False, 'score': {'scaler': True, 'scaler_kwargs': {'n': 0.25}, 'labels': 'None', 'lname': 'None'}, 'cluster': {'method': 'agglomerative-jaccard', 'params': {'n_clusters': 'None', 'linkage': 'average', 'distance_threshold': 0.92, 'compute_full_tree': True}, 'cluster_plots': False, 'min_rep': None, 'max_rep': None, 'save_matrix': True, 'dist_thresh': 100}, 'model': {'cv': 5, 'random_state': 'None'}, 'model_dir': 'output/example-model/', 'basis_dir': 'output/example-model/', 'score_dir': 'output/example-model/'}
[3]:
filenames = sorted(
    [
        fa.rstrip(".gz")
        for fa, ext in product(
            glob.glob(os.path.join("demo_example", "input", "*")),
            config["input_file_exts"],
        )
        if fa.rstrip(".gz").endswith(f".{ext}")
    ]
)

families = sorted([os.path.splitext(os.path.basename(f))[0] for f in filenames])

print(families)
['TIGR03149', 'nirS', 'nxrA']

Snekmer Cluster Mode

Results and output files from Snekmer’s clustering mode can be found in the cluster directory.

Overall results are contained within snekmer.csv. To see the results, load and parse the file:

[4]:
# read cluster results
results = pd.read_csv(os.path.join("demo_example", "output", "cluster", "snekmer.csv"))
results = results.sort_values(by="cluster").reset_index(drop=True)
results["cluster"] = results["cluster"].astype(str)
results
[4]:
filename sequence_id sequence_length background cluster
0 nxrA WP_012964344.1 1154 False 0
1 nxrA WP_013249767.1 1147 False 0
2 nxrA WP_080885705.1 1148 False 0
3 nxrA WP_013249749.1 1146 False 0
4 nxrA WP_053381689.1 1145 False 0
... ... ... ... ... ...
96 TIGR03149 NP_418496.1 223 False 2
97 TIGR03149 NP_313081.1 223 False 2
98 TIGR03149 WP_065430736.1 231 False 2
99 TIGR03149 WP_026210921.1 225 False 2
100 TIGR03149 WP_017422323.1 228 False 2

101 rows × 5 columns

As seen above, the results table summarizes the input sequences as well as the cluster assignment determined by Snekmer. In the above table, each row corresponds to an individual sequence, and the columns summarize various aspects per sequence:

Column Name

Description

filename

Name of file containing given sequence

sequence_id

Sequence ID, as taken from FASTA file

sequence_length

Length of sequence (i.e. number of characters long the sequence is)

background

True if sequence has been labeled as a background sequence; False otherwise

cluster

Numerical cluster assignment, given as an integer between 0 and n, where n is the number of clusters determined by Snekmer

Recall that the demo files consist of 3 sets of sequences previously annotated to 3 individual protein families. To view the cluster assignments given by the Snekmer clusters (see cluster column), we can generate a summary plot of how cluster assignments are distributed across sequences of known annotation.

[5]:
fig, ax = plt.subplots(dpi=150)
g = sns.histplot(x="cluster", hue="filename", multiple="stack", data=results, ax=ax)

ax.set_xlabel("Cluster")
ax.set_title("Snekmer Cluster Assignments vs. Family Label")
g.legend_.set_title("Family Label")
../_images/tutorial_snekmer_demo_12_0.png

In this case, the true family labels are well-resolved by Snekmer’s clustering algorithm. This result arises from the relative separation between sequences in each of these 3 families in terms of their amino acid recoding (AAR)-kmer profile. This will not always be the case for a given set of input sequences.

Snekmer Model Mode

We also train a kmer-based scoring model for the family of interest based on the prevalence of kmers in each family. The scoring model is then used to develop in-family vs. out-of-family classification models. Snekmer scoring output files can be found in the score directory, and model output files can be found in the model directory.

Kmer Probability Scores for Family Assignment

For simplicity, below we will highlight one particular family chosen at random and display the kmer probability scores for assignment into the example family. However, Snekmer generates individual scoring models for each family, so feel free to explore the other family probability scores in the same manner.

[6]:
example_family = np.random.choice(families)
print("example family:\t", example_family)

# show relevant subset of columns
example_score_output = pd.read_csv(
    os.path.join("demo_example", "output", "scoring", "sequences", f"{example_family}.csv.gz")
)[["filename", "sequence_id", "label", f"{example_family}_score"]]
example_score_output
example family:  TIGR03149
[6]:
filename sequence_id label TIGR03149_score
0 nxrA WP_013249767.1 nxrA 0.102714
1 nxrA WP_080885705.1 nxrA 0.027030
2 nxrA WP_013249749.1 nxrA 0.107812
3 nxrA WP_053381689.1 nxrA 0.048270
4 nxrA WP_080885591.1 nxrA 0.060155
... ... ... ... ...
96 nirS WP_011383805.1 nirS 0.044706
97 nirS WP_049724801.1 nirS 0.042535
98 nirS WP_041099757.1 nirS 0.051193
99 nirS WP_015258444.1 nirS 0.071264
100 nirS WP_014238329.1 nirS 0.064080

101 rows × 4 columns

[7]:
# first 10 probability scores for in-family kmers
example_score_output[example_score_output["label"] == example_family].head(10)
[7]:
filename sequence_id label TIGR03149_score
13 TIGR03149 WP_017422323.1 TIGR03149 0.625590
14 TIGR03149 YP_204935.1 TIGR03149 0.596738
15 TIGR03149 WP_005721361.1 TIGR03149 0.760056
16 TIGR03149 NP_798306.1 TIGR03149 0.465294
17 TIGR03149 NP_439225.1 TIGR03149 0.660117
18 TIGR03149 WP_045102325.1 TIGR03149 0.565506
19 TIGR03149 WP_017820467.1 TIGR03149 0.404533
20 TIGR03149 WP_010944374.1 TIGR03149 0.649658
21 TIGR03149 WP_014992036.1 TIGR03149 0.657690
22 TIGR03149 WP_011218001.1 TIGR03149 0.478748
[8]:
# first 10 probability scores for out-of-family kmers
example_score_output[example_score_output["label"] != example_family].head(10)
[8]:
filename sequence_id label TIGR03149_score
0 nxrA WP_013249767.1 nxrA 0.102714
1 nxrA WP_080885705.1 nxrA 0.027030
2 nxrA WP_013249749.1 nxrA 0.107812
3 nxrA WP_053381689.1 nxrA 0.048270
4 nxrA WP_080885591.1 nxrA 0.060155
5 nxrA WP_053381686.1 nxrA 0.016777
6 nxrA WP_053378142.1 nxrA 0.016095
7 nxrA WP_053381277.1 nxrA 0.009149
8 nxrA WP_080886776.1 nxrA 0.096445
9 nxrA WP_053381280.1 nxrA 0.018812

Column descriptions are as follows:

Column Name

Description

filename

Name of file containing given sequence

sequence_id

Sequence ID, as taken from FASTA file

label

Label (i.e., family assignment) of the given sequence

<FAMILY>_score

Score for the assignment of the sequence into the given family. M inimum possible score is -1; maximum possible score is 1.0.

Perhaps unsurprisingly, the kmer probability scores for assignment into the example family differ significantly for sequences that are in-family vs. out-of-family.

To better illustrate this point, we can generate boxplots of the probability score distributions per family.

[9]:
for fam in families:
    example_score_output = pd.read_csv(
        os.path.join("demo_example", "output", "scoring", "sequences", f"{fam}.csv.gz")
    )[["filename", "sequence_id", "label", f"{fam}_score"]]

    plt.figure(figsize=(6, 2), dpi=150)
    ax = sns.boxplot(x="label", y=f"{fam}_score", data=example_score_output)

    ax.set_title(f"Distribution of {fam} Scores for Sequences in Different Families")
    ax.set_xlabel("Family")
    ax.set_ylabel(f"{fam} Score")
../_images/tutorial_snekmer_demo_21_0.png
../_images/tutorial_snekmer_demo_21_1.png
../_images/tutorial_snekmer_demo_21_2.png

We can assess how well the scores perform using the weights determined for each family. Note that it is immediately obvious that each of the family scoring methods performs well in identifying in-family sequences versus out-of-family sequences.

Not all family scoring methods perform equally well in terms of differentation between sequences belonging to different families. Differing scorer performances can be attributed to a variety of factors, e.g. parameters such as the alphabet and k, existing levels of similarity between sequences in different families, etc. The specificity of a given family kmer probability scorer is also impacted by (a) the selected family sequence set, and (b) the sequence set of the out-of-family kmers. In other words, as in other machine learning problems, the curation of the training set will strongly impact the resulting model.

The probabilities and scores assigned to each feature in the kmer set is also computed and output into a dataframe. The contents of one of these dataframes is as follows:

[10]:
print("example family:\t", example_family)
pd.read_csv(os.path.join("demo_example", "output", "scoring", "weights", f"{example_family}.csv.gz"))
example family:  TIGR03149
[10]:
kmer sample background
0 VSSSSSSSVVVVVV 0.179487 NaN
1 SSSSSSSVVVVVVV 0.179487 NaN
2 SSSSSSVVVVVVVV 0.179487 NaN
3 SSSSSVVVVVVVVV 0.051282 NaN
4 SSSSVVVVVVVVVS 0.015437 NaN
... ... ... ...
1672 VVVSSSVSVSSSSS 0.025641 NaN
1673 VVSVSVSSSSSSVV 0.025641 NaN
1674 SVSSSSSSVVSSVV 0.025641 NaN
1675 VSSSSSSVVSSVVS -0.012821 NaN
1676 SSSSSSVVSSVVSS 0.025641 NaN

1677 rows × 3 columns

In this view, the relative weight of a given AAR-kmer to a given family assignment is summarized. Each column displays the following information:

Column Name

Description

kmer

Amino acid reduced (AAR) kmer

sample

Weight of the AAR-kmer toward family assignment based on prevalence in sample sequences

background

Weight of the AAR-kmer toward family assignment based on prevalence in sample sequences; np.nan if no background sequences are given

Note that this example does not include any sequences specified as background sequences, and thus the background column does not contribute additional weights in scoring.

The model objects can be loaded (via snekmer.io.load_pickle) and then applied elsewhere, e.g. to a new set of unknown sequences.

Snekmer Search Mode

Say a user trains the four models above, and would then like to score and evaluate sequences with unknown family assignments. The user can use snekmer search, which uses the kmer basis set for the desired family to create kmer vectors for unknown sequences, then apply the family scorer to the vectorized unknown sequences, and finally use the model to predict family assignments for the unknown sequences.

The results from snekmer search on the demo files are shown below:

[11]:
print("example family:\t", example_family)
pd.read_csv(
    os.path.join("demo_example", "output", "search", example_family, f"{example_family}.csv")
)
example family:  TIGR03149
[11]:
filename sequence_id sequence_length score in_family probability model
0 TIGR03149.faa WP_017422323.1 228 0.618018 True 0.805785 TIGR03149.model
1 TIGR03149.faa YP_204935.1 228 0.564928 True 0.762935 TIGR03149.model
2 TIGR03149.faa WP_005721361.1 226 0.727920 True 0.875304 TIGR03149.model
3 TIGR03149.faa NP_798306.1 228 0.467964 True 0.669274 TIGR03149.model
4 TIGR03149.faa NP_439225.1 225 0.591324 True 0.785015 TIGR03149.model
5 TIGR03149.faa WP_045102325.1 228 0.568916 True 0.766369 TIGR03149.model
6 TIGR03149.faa WP_017820467.1 228 0.368263 True 0.556724 TIGR03149.model
7 TIGR03149.faa WP_010944374.1 225 0.655749 True 0.832490 TIGR03149.model
8 TIGR03149.faa WP_014992036.1 225 0.639143 True 0.821116 TIGR03149.model
9 TIGR03149.faa WP_011218001.1 230 0.370323 True 0.559154 TIGR03149.model
10 TIGR03149.faa WP_004744440.1 228 0.589265 True 0.783347 TIGR03149.model
11 TIGR03149.faa WP_005542788.1 225 0.616877 True 0.804929 TIGR03149.model
12 TIGR03149.faa WP_012550556.1 228 0.552373 True 0.751899 TIGR03149.model
13 TIGR03149.faa WP_038513055.1 229 0.579075 True 0.774958 TIGR03149.model
14 TIGR03149.faa WP_012604317.1 228 0.707271 True 0.864114 TIGR03149.model
15 TIGR03149.faa WP_006249704.1 225 0.666339 True 0.839438 TIGR03149.model
16 TIGR03149.faa WP_044623667.1 230 0.383099 True 0.574164 TIGR03149.model
17 TIGR03149.faa WP_011200980.1 226 0.583368 True 0.778520 TIGR03149.model
18 TIGR03149.faa WP_012072459.1 226 0.583059 True 0.778265 TIGR03149.model
19 TIGR03149.faa WP_026210921.1 225 0.621285 True 0.808220 TIGR03149.model
20 TIGR03149.faa WP_017027946.1 228 0.320655 True 0.500022 TIGR03149.model
21 TIGR03149.faa WP_065430736.1 231 0.537383 True 0.738279 TIGR03149.model
22 TIGR03149.faa NP_313081.1 223 1.013046 True 0.964869 TIGR03149.model
23 TIGR03149.faa NP_418496.1 223 1.013046 True 0.964869 TIGR03149.model
24 TIGR03149.faa NP_709848.1 223 1.013046 True 0.964869 TIGR03149.model
25 TIGR03149.faa YP_006781347.1 223 1.013046 True 0.964869 TIGR03149.model
26 TIGR03149.faa YP_002410367.1 223 1.013046 True 0.964869 TIGR03149.model
27 TIGR03149.faa YP_006122421.1 223 1.013046 True 0.964869 TIGR03149.model
28 TIGR03149.faa WP_005712433.1 225 0.620002 True 0.807266 TIGR03149.model
29 TIGR03149.faa WP_038636309.1 223 0.986725 True 0.960341 TIGR03149.model
30 TIGR03149.faa NP_463144.1 223 0.947326 True 0.952504 TIGR03149.model
31 TIGR03149.faa WP_047782341.1 223 0.423274 True 0.620361 TIGR03149.model
32 TIGR03149.faa WP_004922233.1 223 0.653444 True 0.830947 TIGR03149.model
33 TIGR03149.faa NP_458577.1 223 0.884588 True 0.936925 TIGR03149.model
34 TIGR03149.faa WP_015840643.1 223 0.770016 True 0.895678 TIGR03149.model
35 TIGR03149.faa WP_024486115.1 223 0.525474 True 0.727121 TIGR03149.model
36 TIGR03149.faa WP_043081751.1 223 0.647998 True 0.827255 TIGR03149.model
37 TIGR03149.faa WP_015431895.1 228 0.377642 True 0.567767 TIGR03149.model
38 TIGR03149.faa YP_856980.1 230 0.526662 True 0.728247 TIGR03149.model

Each column summarizes the following information for a given sequence:

Column Name

Description

filename

Name of file containing given sequence

sequence_id

Sequence ID, as taken from FASTA file

sequence_length

Length of sequence (i.e. number of characters long the sequence is)

score

Score for the assignment of the sequence into the desired family based on the family model. Scores are summed from kmer contributions using the family scoring model

in_family

True if in-family assignment is predicted (probability > 0.5); False otherwise.

probability

Probability of in-family assignment. 1.0 = maximum probability for a given sequence’s assignment into a family; 0.0 = lowest probability; 0.5 = equal probability of in-family vs. out-of-family assignment

model

Name of model used to evaluate the sequence

In this example application, the in_family and probability columns are of particular importance to the user. The in_family column tells the user whether a particular input sequence is given an in-family assignment based on the machine learning model which has been previously trained, and the probability informs the user of the in-family assignment probability determined by the same model.

In the case of the demo files, since all of the sequences above belong to the given family, and by design of the demo were also the families used to train the family model, the sequences all have relatively high probabilities and consistently do get assigned to the protein family. However, in this demonstration, we hope it is apparent to the user how trained protein family models can be applied to a new sequence set and the resulting snekmer search results evaluated.

Conclusion

In conclusion, we have demonstrated Snekmer’s utility across 3 potential operating models (cluster, model, and search). We hope this tutorial provides useful information to users regarding applying Snekmer to their own sequence sets and interpreting results from Snekmer.