Learn/Apply: Full Pipeline Reference

Note

New users: snekmer easy is the recommended entry point. It runs the complete pipeline with a single command and no manual setup. See the easy tutorial.

This page is for users who need direct control over snekmer learn and snekmer apply, for example:

  • Incrementally adding new training sequences to an existing model

  • Reusing a trained model against multiple query sets

  • Customising intermediate pipeline steps

When to use learn / apply directly

Situation

Recommendation

First time running, want results fast

easy

Existing config.yaml and directory layout

snekmer learn then snekmer apply

Adding new training data to an existing model

snekmer learn then snekmer apply

Reusing a trained model against new query sequences

snekmer apply only (skip learn)

Demo data

The commands below use the demo data in resources/demo_sequences/learn_apply_inputs/:

resources/demo_sequences/learn_apply_inputs/
├── learn/                        ← 10 training FASTA files (10,000 proteins, 200 families)
│   ├── training_sequences_1.fasta
│   ├── ...
│   └── training_sequences_10.fasta
├── apply/
│   └── test_sequences_1.fasta    ← 3,000 query proteins
└── annotations/
    └── TIGRFAMs_annotation.ann   ← id/family TSV

All commands assume you are running from the root of the Snekmer repository.

Directory layout

snekmer learn and snekmer apply each require their own working directory with a specific structure. easy creates these automatically; when using the modes directly you build them yourself.

learn workspace

learn/
├── config.yaml
├── annotations/
│   └── annotations.ann       ← tab-separated id / family file
└── input/
    ├── training_sequences_1.fasta
    └── ...

apply workspace

apply/
├── config.yaml
├── input/
│   └── test_sequences_1.fasta
├── counts/
│   └── kmer_counts_total.csv      ← copied from learn output
├── confidence/
│   └── global_confidence_scores.csv   ← copied from learn output
└── stats/
    └── family_summary_stats.csv   ← copied from learn output

Step 1: Set up the learn workspace

mkdir -p learn/input learn/annotations

# Copy training sequences into the workspace
cp resources/demo_sequences/learn_apply_inputs/learn/*.fasta learn/input/

# Copy annotation file
cp resources/demo_sequences/learn_apply_inputs/annotations/TIGRFAMs_annotation.ann \
   learn/annotations/annotations.ann

Note

On Linux/macOS you can use symlinks instead of copying to save disk space:

ln -s "$(pwd)/resources/demo_sequences/learn_apply_inputs/learn/"*.fasta learn/input/

Symlinks are not supported on Windows; use cp there.

Step 2: Run snekmer learn

If you have a config.yaml in the learn/ directory:

snekmer learn -d learn

Without a config file (uses built-in defaults):

snekmer learn --no-default-configfile -d learn

To preview what will run without executing:

snekmer learn --no-default-configfile --dry-run -d learn

Tip

Use the same --k and --alphabet values for both learn and apply. Mismatched encoding parameters will produce incorrect results.

learn writes its outputs to learn/output/ and creates a convenience learn/apply_inputs/ directory alongside it:

learn/
├── apply_inputs/           ← ready-to-use handoff files for snekmer apply
│   ├── counts/kmer_counts_total.csv
│   ├── stats/family_summary_stats.csv
│   └── confidence/global_confidence_scores.csv
└── output/
    ├── kmerize/    ← per-file k-mer labels (.kmers)
    ├── vector/     ← per-file k-mer vectors (.npz)
    ├── learn/      ← per-file and merged k-mer count matrices
    └── eval_conf/  ← confidence scores and family statistics

Step 3: Copy learn outputs into the apply workspace

mkdir -p apply/input apply/counts apply/confidence apply/stats

# Query sequences (use cp on Windows; symlink on Linux/macOS)
cp resources/demo_sequences/learn_apply_inputs/apply/test_sequences_1.fasta apply/input/

# Handoff files from learn (apply_inputs/ is at the root of the learn workspace)
cp learn/apply_inputs/counts/kmer_counts_total.csv       apply/counts/
cp learn/apply_inputs/confidence/global_confidence_scores.csv  apply/confidence/
cp learn/apply_inputs/stats/family_summary_stats.csv     apply/stats/

Step 4: Run snekmer apply

snekmer apply -d apply

Or without a config file:

snekmer apply --no-default-configfile -d apply

Results are written to apply/snekmer_results.csv (one row per query sequence):

Sequence                          Prediction   Score    delta   Confidence
tr|A0A2S8EUS7|A0A2S8EUS7_9RHOB   TIGR01783    0.199    0.00    0.383
tr|A0A401ZGP4|A0A401ZGP4_9CHLR   TIGR00757    0.316    0.02    0.922
tr|A0A427BXE3|A0A427BXE3_9GAMM   TIGR01023    0.198    0.08    1.000
...

See Snekmer easy Tutorial for a description of each output column and guidance on filtering by confidence score.

Key parameters

The most commonly adjusted parameters. Pass them as CLI flags or set them in config.yaml; see Setting up User Configuration (config.yaml) for the full reference.

Parameter

Default

Description

--k / k

8

K-mer length

--alphabet / alphabet

2 (solvacc)

Amino acid reduction alphabet (0–5 or name)

--selection / selection

top_hit

Annotation selection method: top_hit, greatest_distance, combined_distance

--threshold / threshold

Median

Family score threshold for filtering: Median, Mean, 90th Percentile, None

--apply-output / apply_output

snekmer_results.csv

Output filename for the results CSV

For the full list of options run snekmer learn --help or snekmer apply --help, or see All Options in the CLI reference.

Reusing a trained model

If you already have a learn/ workspace with valid apply_inputs/ (e.g., from a previous run), you can skip snekmer learn and run snekmer apply against any new set of query sequences. Just update apply/input/ and re-run apply.

To add new training families to an existing model, re-run snekmer learn pointing to the expanded training set. The merged counts matrix accumulates across runs.

Deep-dive notebook

For a step-by-step walkthrough of every internal pipeline rule (vectorization, k-mer count matrix construction, reversed-sequence decoy evaluation, and confidence calibration), see the companion notebook:

docs/source/tutorial/snekmer_learn_apply_tutorial.ipynb

This notebook exposes the Python code behind each Snakemake rule and is intended for users who want to understand the method in detail or adapt intermediate outputs.