Model / Cluster / Search Tutorial

This tutorial walks through three Snekmer modes (Cluster, Model, and Search) using the demo data included in the repository.

Mode

What it does

cluster

Unsupervised clustering of sequences by k-mer profile. Produces a cluster assignment table and optional figures (t-SNE, UMAP, PCA).

model

Trains one-vs-rest supervised models from annotated sequences. Produces per-family classification models and K-fold cross-validation figures (AUC ROC, PR AUC).

search

Scores new sequences against models from snekmer model. Produces per-family annotation probability tables.

Demo data

Input files are in resources/demo_sequences/model_cluster_search_inputs/:

model_cluster_search_inputs/
├── nirS.faa        ← nitrite reductase family sequences
├── nxrA.faa        ← nitrite oxidoreductase family sequences
└── TIGR03149.faa   ← TIGRFAM 03149 family sequences

All commands below assume you are running from the demo workspace directory:

cd resources/model_cluster_search_demo

Configuration

A config.yaml is required in the working directory. The demo includes one pre-configured for these three input families:

cat config.yaml

Key parameters:

Parameter

Default

Description

k

8

K-mer length

alphabet

2 (solvacc)

Amino acid reduction alphabet (0–5 or name)

cluster.method

agglomerative-jaccard

Clustering algorithm

cluster.params.distance_threshold

0.92

Jaccard distance threshold for cluster splitting

model.cv

5

K-fold cross-validation folds

See Setting up User Configuration (config.yaml) for the full parameter reference.

Running the demo

The run_demo.py script resets the workspace, copies inputs, and runs all three modes in order:

python run_demo.py

Or run each mode individually after copying the input files and config manually:

Step 1: Copy inputs and config

cp ../demo_sequences/model_cluster_search_inputs/nirS.faa       input/
cp ../demo_sequences/model_cluster_search_inputs/nxrA.faa        input/
cp ../demo_sequences/model_cluster_search_inputs/TIGR03149.faa   input/
cp ../config.yaml ./config.yaml

Step 2: Cluster

snekmer cluster --scheduler greedy --configfile=./config.yaml

Step 3: Model

snekmer model --scheduler greedy --configfile=./config.yaml

Output structure

After running all three modes the output directory contains:

output/
├── cluster/
│   ├── snekmer.csv               ← cluster assignment table (one row per sequence)
│   └── figures/                  ← t-SNE / UMAP / PCA plots (if enabled)
├── kmerize/
│   ├── nirS.kmers
│   ├── nxrA.kmers
│   └── TIGR03149.kmers
├── model/
│   ├── nirS.model
│   ├── nxrA.model
│   └── TIGR03149.model
├── scoring/
│   ├── sequences/                ← per-family sequence probability scores (.csv.gz)
│   ├── weights/                  ← per-family k-mer weights (.csv.gz)
│   └── *.scorer                  ← scorer objects
├── search/                       ← search results (one CSV per family)
├── Snekmer_Cluster_Report.html
└── Snekmer_Model_Report.html

Reading cluster results

output/cluster/snekmer.csv has one row per input sequence:

import pandas as pd
df = pd.read_csv("output/cluster/snekmer.csv")
print(df.head())

Column

Description

sequence_id

Sequence identifier from the FASTA header

filename

Source FASTA file (used as family label)

cluster

Integer cluster assignment

kmer_*

K-mer feature values (one column per k-mer)

Reading model / search results

For each family, output/scoring/sequences/<family>.csv.gz contains the probability score for every sequence:

import pandas as pd

family = "nirS"
df = pd.read_csv(f"output/scoring/sequences/{family}.csv.gz")
print(df[["sequence_id", "label", f"{family}_score"]].head())

For search results, output/search/<family>/<family>.csv contains the annotation probability for unknown sequences:

family = "nirS"
df = pd.read_csv(f"output/search/{family}/{family}.csv")
print(df.head())

HTML summary reports for Cluster and Model are written to the output root (Snekmer_Cluster_Report.html, Snekmer_Model_Report.html).

Jupyter notebook

An interactive version of this tutorial (with result visualisations) is available at docs/source/tutorial/snekmer_demo_notebook.ipynb.