Learn/Apply: Full Pipeline Reference
Note
New users: snekmer easy is the recommended entry point.
It runs the complete pipeline with a single command and no manual setup.
See the easy tutorial.
This page is for users who need direct control over snekmer learn
and snekmer apply, for example:
Incrementally adding new training sequences to an existing model
Reusing a trained model against multiple query sets
Customising intermediate pipeline steps
When to use learn / apply directly
Situation |
Recommendation |
|---|---|
First time running, want results fast |
|
Existing |
|
Adding new training data to an existing model |
|
Reusing a trained model against new query sequences |
|
Demo data
The commands below use the demo data in resources/demo_sequences/learn_apply_inputs/:
resources/demo_sequences/learn_apply_inputs/
├── learn/ ← 10 training FASTA files (10,000 proteins, 200 families)
│ ├── training_sequences_1.fasta
│ ├── ...
│ └── training_sequences_10.fasta
├── apply/
│ └── test_sequences_1.fasta ← 3,000 query proteins
└── annotations/
└── TIGRFAMs_annotation.ann ← id/family TSV
All commands assume you are running from the root of the Snekmer repository.
Directory layout
snekmer learn and snekmer apply each require their own working directory
with a specific structure. easy creates these automatically;
when using the modes directly you build them yourself.
learn workspace
learn/
├── config.yaml
├── annotations/
│ └── annotations.ann ← tab-separated id / family file
└── input/
├── training_sequences_1.fasta
└── ...
apply workspace
apply/
├── config.yaml
├── input/
│ └── test_sequences_1.fasta
├── counts/
│ └── kmer_counts_total.csv ← copied from learn output
├── confidence/
│ └── global_confidence_scores.csv ← copied from learn output
└── stats/
└── family_summary_stats.csv ← copied from learn output
Step 1: Set up the learn workspace
mkdir -p learn/input learn/annotations
# Copy training sequences into the workspace
cp resources/demo_sequences/learn_apply_inputs/learn/*.fasta learn/input/
# Copy annotation file
cp resources/demo_sequences/learn_apply_inputs/annotations/TIGRFAMs_annotation.ann \
learn/annotations/annotations.ann
Note
On Linux/macOS you can use symlinks instead of copying to save disk space:
ln -s "$(pwd)/resources/demo_sequences/learn_apply_inputs/learn/"*.fasta learn/input/
Symlinks are not supported on Windows; use cp there.
Step 2: Run snekmer learn
If you have a config.yaml in the learn/ directory:
snekmer learn -d learn
Without a config file (uses built-in defaults):
snekmer learn --no-default-configfile -d learn
To preview what will run without executing:
snekmer learn --no-default-configfile --dry-run -d learn
Tip
Use the same --k and --alphabet values for both learn and apply.
Mismatched encoding parameters will produce incorrect results.
learn writes its outputs to learn/output/ and creates a convenience
learn/apply_inputs/ directory alongside it:
learn/
├── apply_inputs/ ← ready-to-use handoff files for snekmer apply
│ ├── counts/kmer_counts_total.csv
│ ├── stats/family_summary_stats.csv
│ └── confidence/global_confidence_scores.csv
└── output/
├── kmerize/ ← per-file k-mer labels (.kmers)
├── vector/ ← per-file k-mer vectors (.npz)
├── learn/ ← per-file and merged k-mer count matrices
└── eval_conf/ ← confidence scores and family statistics
Step 3: Copy learn outputs into the apply workspace
mkdir -p apply/input apply/counts apply/confidence apply/stats
# Query sequences (use cp on Windows; symlink on Linux/macOS)
cp resources/demo_sequences/learn_apply_inputs/apply/test_sequences_1.fasta apply/input/
# Handoff files from learn (apply_inputs/ is at the root of the learn workspace)
cp learn/apply_inputs/counts/kmer_counts_total.csv apply/counts/
cp learn/apply_inputs/confidence/global_confidence_scores.csv apply/confidence/
cp learn/apply_inputs/stats/family_summary_stats.csv apply/stats/
Step 4: Run snekmer apply
snekmer apply -d apply
Or without a config file:
snekmer apply --no-default-configfile -d apply
Results are written to apply/snekmer_results.csv (one row per query sequence):
Sequence Prediction Score delta Confidence
tr|A0A2S8EUS7|A0A2S8EUS7_9RHOB TIGR01783 0.199 0.00 0.383
tr|A0A401ZGP4|A0A401ZGP4_9CHLR TIGR00757 0.316 0.02 0.922
tr|A0A427BXE3|A0A427BXE3_9GAMM TIGR01023 0.198 0.08 1.000
...
See Snekmer easy Tutorial for a description of each output column and guidance on filtering by confidence score.
Key parameters
The most commonly adjusted parameters. Pass them as CLI flags or set them in
config.yaml; see Setting up User Configuration (config.yaml) for the full reference.
Parameter |
Default |
Description |
|---|---|---|
|
|
K-mer length |
|
|
Amino acid reduction alphabet (0–5 or name) |
|
|
Annotation selection method: |
|
|
Family score threshold for filtering: |
|
|
Output filename for the results CSV |
For the full list of options run snekmer learn --help or snekmer apply --help,
or see All Options in the CLI reference.
Reusing a trained model
If you already have a learn/ workspace with valid apply_inputs/ (e.g., from a
previous run), you can skip snekmer learn and run snekmer apply against any
new set of query sequences. Just update apply/input/ and re-run apply.
To add new training families to an existing model, re-run snekmer learn pointing
to the expanded training set. The merged counts matrix accumulates across runs.
Deep-dive notebook
For a step-by-step walkthrough of every internal pipeline rule (vectorization, k-mer count matrix construction, reversed-sequence decoy evaluation, and confidence calibration), see the companion notebook:
docs/source/tutorial/snekmer_learn_apply_tutorial.ipynb
This notebook exposes the Python code behind each Snakemake rule and is intended for users who want to understand the method in detail or adapt intermediate outputs.