Model / Cluster / Search Tutorial
This tutorial walks through three Snekmer modes (Cluster, Model, and Search) using the demo data included in the repository.
Mode |
What it does |
|---|---|
|
Unsupervised clustering of sequences by k-mer profile. Produces a cluster assignment table and optional figures (t-SNE, UMAP, PCA). |
|
Trains one-vs-rest supervised models from annotated sequences. Produces per-family classification models and K-fold cross-validation figures (AUC ROC, PR AUC). |
|
Scores new sequences against models from |
Demo data
Input files are in resources/demo_sequences/model_cluster_search_inputs/:
model_cluster_search_inputs/
├── nirS.faa ← nitrite reductase family sequences
├── nxrA.faa ← nitrite oxidoreductase family sequences
└── TIGR03149.faa ← TIGRFAM 03149 family sequences
All commands below assume you are running from the demo workspace directory:
cd resources/model_cluster_search_demo
Configuration
A config.yaml is required in the working directory. The demo includes one
pre-configured for these three input families:
cat config.yaml
Key parameters:
Parameter |
Default |
Description |
|---|---|---|
|
|
K-mer length |
|
|
Amino acid reduction alphabet (0–5 or name) |
|
|
Clustering algorithm |
|
|
Jaccard distance threshold for cluster splitting |
|
|
K-fold cross-validation folds |
See Setting up User Configuration (config.yaml) for the full parameter reference.
Running the demo
The run_demo.py script resets the workspace, copies inputs, and runs all three modes
in order:
python run_demo.py
Or run each mode individually after copying the input files and config manually:
Step 1: Copy inputs and config
cp ../demo_sequences/model_cluster_search_inputs/nirS.faa input/
cp ../demo_sequences/model_cluster_search_inputs/nxrA.faa input/
cp ../demo_sequences/model_cluster_search_inputs/TIGR03149.faa input/
cp ../config.yaml ./config.yaml
Step 2: Cluster
snekmer cluster --scheduler greedy --configfile=./config.yaml
Step 3: Model
snekmer model --scheduler greedy --configfile=./config.yaml
Step 4: Collect model artifacts (required before Search)
snekmer search needs the per-family .model, .kmers, and .scorer
files collected into a single directory (output/example-model/ by default):
mkdir -p output/example-model
cp output/model/*.model output/example-model/
cp output/kmerize/*.kmers output/example-model/
cp output/scoring/*.scorer output/example-model/
Step 5: Search
snekmer search --scheduler greedy --configfile=./config.yaml
Output structure
After running all three modes the output directory contains:
output/
├── cluster/
│ ├── snekmer.csv ← cluster assignment table (one row per sequence)
│ └── figures/ ← t-SNE / UMAP / PCA plots (if enabled)
├── kmerize/
│ ├── nirS.kmers
│ ├── nxrA.kmers
│ └── TIGR03149.kmers
├── model/
│ ├── nirS.model
│ ├── nxrA.model
│ └── TIGR03149.model
├── scoring/
│ ├── sequences/ ← per-family sequence probability scores (.csv.gz)
│ ├── weights/ ← per-family k-mer weights (.csv.gz)
│ └── *.scorer ← scorer objects
├── search/ ← search results (one CSV per family)
├── Snekmer_Cluster_Report.html
└── Snekmer_Model_Report.html
Reading cluster results
output/cluster/snekmer.csv has one row per input sequence:
import pandas as pd
df = pd.read_csv("output/cluster/snekmer.csv")
print(df.head())
Column |
Description |
|---|---|
|
Sequence identifier from the FASTA header |
|
Source FASTA file (used as family label) |
|
Integer cluster assignment |
|
K-mer feature values (one column per k-mer) |
Reading model / search results
For each family, output/scoring/sequences/<family>.csv.gz contains
the probability score for every sequence:
import pandas as pd
family = "nirS"
df = pd.read_csv(f"output/scoring/sequences/{family}.csv.gz")
print(df[["sequence_id", "label", f"{family}_score"]].head())
For search results, output/search/<family>/<family>.csv contains
the annotation probability for unknown sequences:
family = "nirS"
df = pd.read_csv(f"output/search/{family}/{family}.csv")
print(df.head())
HTML summary reports for Cluster and Model are written to the output root
(Snekmer_Cluster_Report.html, Snekmer_Model_Report.html).
Jupyter notebook
An interactive version of this tutorial (with result visualisations) is available at
docs/source/tutorial/snekmer_demo_notebook.ipynb.