.. _model-cluster-search-tutorial: Model / Cluster / Search Tutorial ================================== This tutorial walks through three Snekmer modes (**Cluster**, **Model**, and **Search**) using the demo data included in the repository. .. list-table:: :header-rows: 1 :widths: 20 80 * - Mode - What it does * - ``cluster`` - Unsupervised clustering of sequences by k-mer profile. Produces a cluster assignment table and optional figures (t-SNE, UMAP, PCA). * - ``model`` - Trains one-vs-rest supervised models from annotated sequences. Produces per-family classification models and K-fold cross-validation figures (AUC ROC, PR AUC). * - ``search`` - Scores new sequences against models from ``snekmer model``. Produces per-family annotation probability tables. Demo data --------- Input files are in ``resources/demo_sequences/model_cluster_search_inputs/``: .. code-block:: text model_cluster_search_inputs/ ├── nirS.faa ← nitrite reductase family sequences ├── nxrA.faa ← nitrite oxidoreductase family sequences └── TIGR03149.faa ← TIGRFAM 03149 family sequences All commands below assume you are running from the demo workspace directory: .. code-block:: bash cd resources/model_cluster_search_demo Configuration ------------- A ``config.yaml`` is required in the working directory. The demo includes one pre-configured for these three input families: .. code-block:: bash cat config.yaml Key parameters: .. list-table:: :header-rows: 1 :widths: 25 20 55 * - Parameter - Default - Description * - ``k`` - ``8`` - K-mer length * - ``alphabet`` - ``2`` (solvacc) - Amino acid reduction alphabet (0–5 or name) * - ``cluster.method`` - ``agglomerative-jaccard`` - Clustering algorithm * - ``cluster.params.distance_threshold`` - ``0.92`` - Jaccard distance threshold for cluster splitting * - ``model.cv`` - ``5`` - K-fold cross-validation folds See :doc:`../getting_started/config` for the full parameter reference. Running the demo ---------------- The ``run_demo.py`` script resets the workspace, copies inputs, and runs all three modes in order: .. code-block:: bash python run_demo.py Or run each mode individually after copying the input files and config manually: Step 1: Copy inputs and config ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash cp ../demo_sequences/model_cluster_search_inputs/nirS.faa input/ cp ../demo_sequences/model_cluster_search_inputs/nxrA.faa input/ cp ../demo_sequences/model_cluster_search_inputs/TIGR03149.faa input/ cp ../config.yaml ./config.yaml Step 2: Cluster ~~~~~~~~~~~~~~~~ .. code-block:: bash snekmer cluster --scheduler greedy --configfile=./config.yaml Step 3: Model ~~~~~~~~~~~~~~ .. code-block:: bash snekmer model --scheduler greedy --configfile=./config.yaml Step 4: Collect model artifacts (required before Search) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``snekmer search`` needs the per-family ``.model``, ``.kmers``, and ``.scorer`` files collected into a single directory (``output/example-model/`` by default): .. code-block:: bash mkdir -p output/example-model cp output/model/*.model output/example-model/ cp output/kmerize/*.kmers output/example-model/ cp output/scoring/*.scorer output/example-model/ Step 5: Search ~~~~~~~~~~~~~~~ .. code-block:: bash snekmer search --scheduler greedy --configfile=./config.yaml Output structure ---------------- After running all three modes the output directory contains: .. code-block:: text output/ ├── cluster/ │ ├── snekmer.csv ← cluster assignment table (one row per sequence) │ └── figures/ ← t-SNE / UMAP / PCA plots (if enabled) ├── kmerize/ │ ├── nirS.kmers │ ├── nxrA.kmers │ └── TIGR03149.kmers ├── model/ │ ├── nirS.model │ ├── nxrA.model │ └── TIGR03149.model ├── scoring/ │ ├── sequences/ ← per-family sequence probability scores (.csv.gz) │ ├── weights/ ← per-family k-mer weights (.csv.gz) │ └── *.scorer ← scorer objects ├── search/ ← search results (one CSV per family) ├── Snekmer_Cluster_Report.html └── Snekmer_Model_Report.html Reading cluster results ----------------------- ``output/cluster/snekmer.csv`` has one row per input sequence: .. code-block:: python import pandas as pd df = pd.read_csv("output/cluster/snekmer.csv") print(df.head()) .. list-table:: :header-rows: 1 :widths: 25 75 * - Column - Description * - ``sequence_id`` - Sequence identifier from the FASTA header * - ``filename`` - Source FASTA file (used as family label) * - ``cluster`` - Integer cluster assignment * - ``kmer_*`` - K-mer feature values (one column per k-mer) Reading model / search results ------------------------------- For each family, ``output/scoring/sequences/.csv.gz`` contains the probability score for every sequence: .. code-block:: python import pandas as pd family = "nirS" df = pd.read_csv(f"output/scoring/sequences/{family}.csv.gz") print(df[["sequence_id", "label", f"{family}_score"]].head()) For search results, ``output/search//.csv`` contains the annotation probability for unknown sequences: .. code-block:: python family = "nirS" df = pd.read_csv(f"output/search/{family}/{family}.csv") print(df.head()) HTML summary reports for Cluster and Model are written to the output root (``Snekmer_Cluster_Report.html``, ``Snekmer_Model_Report.html``). Jupyter notebook ---------------- An interactive version of this tutorial (with result visualisations) is available at ``docs/source/tutorial/snekmer_demo_notebook.ipynb``.