Snekmer Demo
In this notebook, we will demonstrate how to apply Snekmer toward the analysis of protein sequences.
Getting Started
Workflow
Snekmer proceeds through a defined workflow executed as individual steps on Snakemake. Two operation modes are available: model
(supervised machine learning) and cluster
(unsupervised clustering). The user should select the mode that best suits their individual use case.
Notes on Using Snekmer
Snekmer assumes that the user will primarily process input files using the command line. For more detailed instructions, refer to the documentation.
The basic process for running Snekmer is as follows:
Verify that your file directory structure is correct and that the top-level directory contains a config.yaml file.
A template configuration file is included in the Snekmer code repository here.
Modify config.yaml as needed.
Use the command line to navigate to the directory containing both the config.yaml file and input directory.
Run
snekmer cluster
,snekmer model
, orsnekmer search
.
Depending on the selected operation mode, output files will vary.
The process detailed above is handled by the included tutorial. We will use this demo to break down the process followed by the tutorial and understand the output files created in the process.
Running Snekmer
First, install Snekmer using the instructions in the user installation guide.
To ensure that the tutorial runs correctly, activate the conda environment containing your Snekmer installation and run the notebook from the environment.
Next, run the Snekmer tutorial. This runs all three Snekmer modes on the demo example files and produces all output files. The tutorial uses the included default configuration parameters to guide the analysis, but the user can modify these parameters if a different configuration set is desired. The tutorial command line instructions are copied below:
conda activate snekmer
cd resources/tutorial/demo_example
./run_demo.sh
Finally, we will initialize some parameters and parse filenames for this demo notebook.
[1]:
# imports
import glob
import os
import yaml
from itertools import product
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
[2]:
# load config file
with open(os.path.join("..", "..", "resources", "config.yaml"), "r") as configfile:
config = yaml.safe_load(configfile)
print(config)
{'k': 14, 'alphabet': 0, 'input_file_exts': ['fasta', 'fna', 'faa', 'fa'], 'input_file_regex': '.*', 'nested_output': False, 'score': {'scaler': True, 'scaler_kwargs': {'n': 0.25}, 'labels': 'None', 'lname': 'None'}, 'cluster': {'method': 'agglomerative-jaccard', 'params': {'n_clusters': 'None', 'linkage': 'average', 'distance_threshold': 0.92, 'compute_full_tree': True}, 'cluster_plots': False, 'min_rep': None, 'max_rep': None, 'save_matrix': True, 'dist_thresh': 100}, 'model': {'cv': 5, 'random_state': 'None'}, 'model_dir': 'output/example-model/', 'basis_dir': 'output/example-model/', 'score_dir': 'output/example-model/'}
[3]:
filenames = sorted(
[
fa.rstrip(".gz")
for fa, ext in product(
glob.glob(os.path.join("demo_example", "input", "*")),
config["input_file_exts"],
)
if fa.rstrip(".gz").endswith(f".{ext}")
]
)
families = sorted([os.path.splitext(os.path.basename(f))[0] for f in filenames])
print(families)
['TIGR03149', 'nirS', 'nxrA']
Snekmer Cluster Mode
Results and output files from Snekmer’s clustering mode can be found in the cluster directory.
Overall results are contained within snekmer.csv. To see the results, load and parse the file:
[4]:
# read cluster results
results = pd.read_csv(os.path.join("demo_example", "output", "cluster", "snekmer.csv"))
results = results.sort_values(by="cluster").reset_index(drop=True)
results["cluster"] = results["cluster"].astype(str)
results
[4]:
filename | sequence_id | sequence_length | background | cluster | |
---|---|---|---|---|---|
0 | nxrA | WP_012964344.1 | 1154 | False | 0 |
1 | nxrA | WP_013249767.1 | 1147 | False | 0 |
2 | nxrA | WP_080885705.1 | 1148 | False | 0 |
3 | nxrA | WP_013249749.1 | 1146 | False | 0 |
4 | nxrA | WP_053381689.1 | 1145 | False | 0 |
... | ... | ... | ... | ... | ... |
96 | TIGR03149 | NP_418496.1 | 223 | False | 2 |
97 | TIGR03149 | NP_313081.1 | 223 | False | 2 |
98 | TIGR03149 | WP_065430736.1 | 231 | False | 2 |
99 | TIGR03149 | WP_026210921.1 | 225 | False | 2 |
100 | TIGR03149 | WP_017422323.1 | 228 | False | 2 |
101 rows × 5 columns
As seen above, the results table summarizes the input sequences as well as the cluster assignment determined by Snekmer. In the above table, each row corresponds to an individual sequence, and the columns summarize various aspects per sequence:
Column Name |
Description |
---|---|
filename |
Name of file containing given sequence |
sequence_id |
Sequence ID, as taken from FASTA file |
sequence_length |
Length of sequence (i.e. number of characters long the sequence is) |
background |
True if sequence has been labeled as a background sequence; False otherwise |
cluster |
Numerical cluster assignment, given as an integer between 0 and n, where n is the number of clusters determined by Snekmer |
Recall that the demo files consist of 3 sets of sequences previously annotated to 3 individual protein families. To view the cluster assignments given by the Snekmer clusters (see cluster
column), we can generate a summary plot of how cluster assignments are distributed across sequences of known annotation.
[5]:
fig, ax = plt.subplots(dpi=150)
g = sns.histplot(x="cluster", hue="filename", multiple="stack", data=results, ax=ax)
ax.set_xlabel("Cluster")
ax.set_title("Snekmer Cluster Assignments vs. Family Label")
g.legend_.set_title("Family Label")
In this case, the true family labels are well-resolved by Snekmer’s clustering algorithm. This result arises from the relative separation between sequences in each of these 3 families in terms of their amino acid recoding (AAR)-kmer profile. This will not always be the case for a given set of input sequences.
Snekmer Model Mode
We also train a kmer-based scoring model for the family of interest based on the prevalence of kmers in each family. The scoring model is then used to develop in-family vs. out-of-family classification models. Snekmer scoring output files can be found in the score directory, and model output files can be found in the model directory.
Kmer Probability Scores for Family Assignment
For simplicity, below we will highlight one particular family chosen at random and display the kmer probability scores for assignment into the example family. However, Snekmer generates individual scoring models for each family, so feel free to explore the other family probability scores in the same manner.
[6]:
example_family = np.random.choice(families)
print("example family:\t", example_family)
# show relevant subset of columns
example_score_output = pd.read_csv(
os.path.join("demo_example", "output", "scoring", "sequences", f"{example_family}.csv.gz")
)[["filename", "sequence_id", "label", f"{example_family}_score"]]
example_score_output
example family: TIGR03149
[6]:
filename | sequence_id | label | TIGR03149_score | |
---|---|---|---|---|
0 | nxrA | WP_013249767.1 | nxrA | 0.102714 |
1 | nxrA | WP_080885705.1 | nxrA | 0.027030 |
2 | nxrA | WP_013249749.1 | nxrA | 0.107812 |
3 | nxrA | WP_053381689.1 | nxrA | 0.048270 |
4 | nxrA | WP_080885591.1 | nxrA | 0.060155 |
... | ... | ... | ... | ... |
96 | nirS | WP_011383805.1 | nirS | 0.044706 |
97 | nirS | WP_049724801.1 | nirS | 0.042535 |
98 | nirS | WP_041099757.1 | nirS | 0.051193 |
99 | nirS | WP_015258444.1 | nirS | 0.071264 |
100 | nirS | WP_014238329.1 | nirS | 0.064080 |
101 rows × 4 columns
[7]:
# first 10 probability scores for in-family kmers
example_score_output[example_score_output["label"] == example_family].head(10)
[7]:
filename | sequence_id | label | TIGR03149_score | |
---|---|---|---|---|
13 | TIGR03149 | WP_017422323.1 | TIGR03149 | 0.625590 |
14 | TIGR03149 | YP_204935.1 | TIGR03149 | 0.596738 |
15 | TIGR03149 | WP_005721361.1 | TIGR03149 | 0.760056 |
16 | TIGR03149 | NP_798306.1 | TIGR03149 | 0.465294 |
17 | TIGR03149 | NP_439225.1 | TIGR03149 | 0.660117 |
18 | TIGR03149 | WP_045102325.1 | TIGR03149 | 0.565506 |
19 | TIGR03149 | WP_017820467.1 | TIGR03149 | 0.404533 |
20 | TIGR03149 | WP_010944374.1 | TIGR03149 | 0.649658 |
21 | TIGR03149 | WP_014992036.1 | TIGR03149 | 0.657690 |
22 | TIGR03149 | WP_011218001.1 | TIGR03149 | 0.478748 |
[8]:
# first 10 probability scores for out-of-family kmers
example_score_output[example_score_output["label"] != example_family].head(10)
[8]:
filename | sequence_id | label | TIGR03149_score | |
---|---|---|---|---|
0 | nxrA | WP_013249767.1 | nxrA | 0.102714 |
1 | nxrA | WP_080885705.1 | nxrA | 0.027030 |
2 | nxrA | WP_013249749.1 | nxrA | 0.107812 |
3 | nxrA | WP_053381689.1 | nxrA | 0.048270 |
4 | nxrA | WP_080885591.1 | nxrA | 0.060155 |
5 | nxrA | WP_053381686.1 | nxrA | 0.016777 |
6 | nxrA | WP_053378142.1 | nxrA | 0.016095 |
7 | nxrA | WP_053381277.1 | nxrA | 0.009149 |
8 | nxrA | WP_080886776.1 | nxrA | 0.096445 |
9 | nxrA | WP_053381280.1 | nxrA | 0.018812 |
Column descriptions are as follows:
Column Name |
Description |
---|---|
filename |
Name of file containing given sequence |
sequence_id |
Sequence ID, as taken from FASTA file |
label |
Label (i.e., family assignment) of the given sequence |
<FAMILY>_score |
Score for the assignment of the sequence into the given family. M inimum possible score is -1; maximum possible score is 1.0. |
Perhaps unsurprisingly, the kmer probability scores for assignment into the example family differ significantly for sequences that are in-family vs. out-of-family.
To better illustrate this point, we can generate boxplots of the probability score distributions per family.
[9]:
for fam in families:
example_score_output = pd.read_csv(
os.path.join("demo_example", "output", "scoring", "sequences", f"{fam}.csv.gz")
)[["filename", "sequence_id", "label", f"{fam}_score"]]
plt.figure(figsize=(6, 2), dpi=150)
ax = sns.boxplot(x="label", y=f"{fam}_score", data=example_score_output)
ax.set_title(f"Distribution of {fam} Scores for Sequences in Different Families")
ax.set_xlabel("Family")
ax.set_ylabel(f"{fam} Score")
We can assess how well the scores perform using the weights determined for each family. Note that it is immediately obvious that each of the family scoring methods performs well in identifying in-family sequences versus out-of-family sequences.
Not all family scoring methods perform equally well in terms of differentation between sequences belonging to different families. Differing scorer performances can be attributed to a variety of factors, e.g. parameters such as the alphabet and k, existing levels of similarity between sequences in different families, etc. The specificity of a given family kmer probability scorer is also impacted by (a) the selected family sequence set, and (b) the sequence set of the out-of-family kmers. In other words, as in other machine learning problems, the curation of the training set will strongly impact the resulting model.
The probabilities and scores assigned to each feature in the kmer set is also computed and output into a dataframe. The contents of one of these dataframes is as follows:
[10]:
print("example family:\t", example_family)
pd.read_csv(os.path.join("demo_example", "output", "scoring", "weights", f"{example_family}.csv.gz"))
example family: TIGR03149
[10]:
kmer | sample | background | |
---|---|---|---|
0 | VSSSSSSSVVVVVV | 0.179487 | NaN |
1 | SSSSSSSVVVVVVV | 0.179487 | NaN |
2 | SSSSSSVVVVVVVV | 0.179487 | NaN |
3 | SSSSSVVVVVVVVV | 0.051282 | NaN |
4 | SSSSVVVVVVVVVS | 0.015437 | NaN |
... | ... | ... | ... |
1672 | VVVSSSVSVSSSSS | 0.025641 | NaN |
1673 | VVSVSVSSSSSSVV | 0.025641 | NaN |
1674 | SVSSSSSSVVSSVV | 0.025641 | NaN |
1675 | VSSSSSSVVSSVVS | -0.012821 | NaN |
1676 | SSSSSSVVSSVVSS | 0.025641 | NaN |
1677 rows × 3 columns
In this view, the relative weight of a given AAR-kmer to a given family assignment is summarized. Each column displays the following information:
Column Name |
Description |
---|---|
kmer |
Amino acid reduced (AAR) kmer |
sample |
Weight of the AAR-kmer toward family assignment based on prevalence in sample sequences |
background |
Weight of the AAR-kmer toward family assignment based on prevalence in sample sequences; |
Note that this example does not include any sequences specified as background sequences, and thus the background
column does not contribute additional weights in scoring.
The model objects can be loaded (via snekmer.io.load_pickle
) and then applied elsewhere, e.g. to a new set of unknown sequences.
Snekmer Search Mode
Say a user trains the four models above, and would then like to score and evaluate sequences with unknown family assignments. The user can use snekmer search
, which uses the kmer basis set for the desired family to create kmer vectors for unknown sequences, then apply the family scorer to the vectorized unknown sequences, and finally use the model to predict family assignments for the unknown sequences.
The results from snekmer search
on the demo files are shown below:
[11]:
print("example family:\t", example_family)
pd.read_csv(
os.path.join("demo_example", "output", "search", example_family, f"{example_family}.csv")
)
example family: TIGR03149
[11]:
filename | sequence_id | sequence_length | score | in_family | probability | model | |
---|---|---|---|---|---|---|---|
0 | TIGR03149.faa | WP_017422323.1 | 228 | 0.618018 | True | 0.805785 | TIGR03149.model |
1 | TIGR03149.faa | YP_204935.1 | 228 | 0.564928 | True | 0.762935 | TIGR03149.model |
2 | TIGR03149.faa | WP_005721361.1 | 226 | 0.727920 | True | 0.875304 | TIGR03149.model |
3 | TIGR03149.faa | NP_798306.1 | 228 | 0.467964 | True | 0.669274 | TIGR03149.model |
4 | TIGR03149.faa | NP_439225.1 | 225 | 0.591324 | True | 0.785015 | TIGR03149.model |
5 | TIGR03149.faa | WP_045102325.1 | 228 | 0.568916 | True | 0.766369 | TIGR03149.model |
6 | TIGR03149.faa | WP_017820467.1 | 228 | 0.368263 | True | 0.556724 | TIGR03149.model |
7 | TIGR03149.faa | WP_010944374.1 | 225 | 0.655749 | True | 0.832490 | TIGR03149.model |
8 | TIGR03149.faa | WP_014992036.1 | 225 | 0.639143 | True | 0.821116 | TIGR03149.model |
9 | TIGR03149.faa | WP_011218001.1 | 230 | 0.370323 | True | 0.559154 | TIGR03149.model |
10 | TIGR03149.faa | WP_004744440.1 | 228 | 0.589265 | True | 0.783347 | TIGR03149.model |
11 | TIGR03149.faa | WP_005542788.1 | 225 | 0.616877 | True | 0.804929 | TIGR03149.model |
12 | TIGR03149.faa | WP_012550556.1 | 228 | 0.552373 | True | 0.751899 | TIGR03149.model |
13 | TIGR03149.faa | WP_038513055.1 | 229 | 0.579075 | True | 0.774958 | TIGR03149.model |
14 | TIGR03149.faa | WP_012604317.1 | 228 | 0.707271 | True | 0.864114 | TIGR03149.model |
15 | TIGR03149.faa | WP_006249704.1 | 225 | 0.666339 | True | 0.839438 | TIGR03149.model |
16 | TIGR03149.faa | WP_044623667.1 | 230 | 0.383099 | True | 0.574164 | TIGR03149.model |
17 | TIGR03149.faa | WP_011200980.1 | 226 | 0.583368 | True | 0.778520 | TIGR03149.model |
18 | TIGR03149.faa | WP_012072459.1 | 226 | 0.583059 | True | 0.778265 | TIGR03149.model |
19 | TIGR03149.faa | WP_026210921.1 | 225 | 0.621285 | True | 0.808220 | TIGR03149.model |
20 | TIGR03149.faa | WP_017027946.1 | 228 | 0.320655 | True | 0.500022 | TIGR03149.model |
21 | TIGR03149.faa | WP_065430736.1 | 231 | 0.537383 | True | 0.738279 | TIGR03149.model |
22 | TIGR03149.faa | NP_313081.1 | 223 | 1.013046 | True | 0.964869 | TIGR03149.model |
23 | TIGR03149.faa | NP_418496.1 | 223 | 1.013046 | True | 0.964869 | TIGR03149.model |
24 | TIGR03149.faa | NP_709848.1 | 223 | 1.013046 | True | 0.964869 | TIGR03149.model |
25 | TIGR03149.faa | YP_006781347.1 | 223 | 1.013046 | True | 0.964869 | TIGR03149.model |
26 | TIGR03149.faa | YP_002410367.1 | 223 | 1.013046 | True | 0.964869 | TIGR03149.model |
27 | TIGR03149.faa | YP_006122421.1 | 223 | 1.013046 | True | 0.964869 | TIGR03149.model |
28 | TIGR03149.faa | WP_005712433.1 | 225 | 0.620002 | True | 0.807266 | TIGR03149.model |
29 | TIGR03149.faa | WP_038636309.1 | 223 | 0.986725 | True | 0.960341 | TIGR03149.model |
30 | TIGR03149.faa | NP_463144.1 | 223 | 0.947326 | True | 0.952504 | TIGR03149.model |
31 | TIGR03149.faa | WP_047782341.1 | 223 | 0.423274 | True | 0.620361 | TIGR03149.model |
32 | TIGR03149.faa | WP_004922233.1 | 223 | 0.653444 | True | 0.830947 | TIGR03149.model |
33 | TIGR03149.faa | NP_458577.1 | 223 | 0.884588 | True | 0.936925 | TIGR03149.model |
34 | TIGR03149.faa | WP_015840643.1 | 223 | 0.770016 | True | 0.895678 | TIGR03149.model |
35 | TIGR03149.faa | WP_024486115.1 | 223 | 0.525474 | True | 0.727121 | TIGR03149.model |
36 | TIGR03149.faa | WP_043081751.1 | 223 | 0.647998 | True | 0.827255 | TIGR03149.model |
37 | TIGR03149.faa | WP_015431895.1 | 228 | 0.377642 | True | 0.567767 | TIGR03149.model |
38 | TIGR03149.faa | YP_856980.1 | 230 | 0.526662 | True | 0.728247 | TIGR03149.model |
Each column summarizes the following information for a given sequence:
Column Name |
Description |
---|---|
filename |
Name of file containing given sequence |
sequence_id |
Sequence ID, as taken from FASTA file |
sequence_length |
Length of sequence (i.e. number of characters long the sequence is) |
score |
Score for the assignment of the sequence into the desired family based on the family model. Scores are summed from kmer contributions using the family scoring model |
in_family |
True if in-family assignment is predicted (probability > 0.5); False otherwise. |
probability |
Probability of in-family assignment. 1.0 = maximum probability for a given sequence’s assignment into a family; 0.0 = lowest probability; 0.5 = equal probability of in-family vs. out-of-family assignment |
model |
Name of model used to evaluate the sequence |
In this example application, the in_family and probability columns are of particular importance to the user. The in_family column tells the user whether a particular input sequence is given an in-family assignment based on the machine learning model which has been previously trained, and the probability informs the user of the in-family assignment probability determined by the same model.
In the case of the demo files, since all of the sequences above belong to the given family, and by design of the demo were also the families used to train the family model, the sequences all have relatively high probabilities and consistently do get assigned to the protein family. However, in this demonstration, we hope it is apparent to the user how trained protein family models can be applied to a new sequence set and the resulting snekmer search
results evaluated.
Conclusion
In conclusion, we have demonstrated Snekmer’s utility across 3 potential operating models (cluster, model, and search). We hope this tutorial provides useful information to users regarding applying Snekmer to their own sequence sets and interpreting results from Snekmer.