Command Line Interface
To run any Snekmer operation mode, call:
snekmer {mode}
where {mode} is one of easy, learn, apply,
cluster, model, or search.
easy is the recommended entry point for new users. It runs
the full Learn/Apply pipeline with a single command and no manual directory
setup. See the Quick Start to get started immediately.
General usage follows the pattern:
snekmer <mode> [snakemake arguments] [snekmer parameter overrides]
Snakemake arguments are passed through directly to Snakemake (they are not
Snekmer-specific). Snekmer parameters can be provided via config.yaml /
--configfile, or overridden via Snekmer parameter flags on the command line.
For an overview of Snekmer usage, reference the help command (snekmer --help).
$ snekmer --help
Snekmer 1.4.1 — protein sequence fingerprinting via amino acid reduction.
Usage:
snekmer <mode> [options]
snekmer <mode> --help
Modes:
cluster Unsupervised clustering workflow.
model Train supervised models + cross-validation reports.
search Score sequences against trained models.
learn Build annotation-associated k-mer distributions + confidence evaluation.
apply Predict annotations using outputs from learn.
easy Guided front-end that runs learn then apply end-to-end.
Global options (accepted by all modes):
--k N K-mer length (default: 8).
--alphabet N Reduced alphabet encoding 0-5 or name (default: 2 = solvacc).
--cores N CPU cores to use (default: all).
--dry-run Show what would be done without executing.
--configfile Path(s) to YAML/JSON config file(s).
-v, --version Print version and exit.
-h, --help Show this help message and exit.
Run 'snekmer <mode> --help' for full options for a specific mode.
Tailored references for the individual operation modes can be accessed
via snekmer {mode} --help. Each subcommand includes only the Snekmer
parameter sections relevant to that mode.
Configuration
Config Precedence
Snekmer resolves configuration using the following precedence order (lowest to highest):
Default configfile (auto):
./config.yaml(or<DIR>/config.yamlwhen using-d/--directory).Explicit configfiles: Any
--configfile PATHvalues, applied in the order given.Snekmer parameter flags: Any Snekmer-specific flags you explicitly provide on the command line (e.g.
--k 10,--alphabet hydro).Key=Value overrides: Any
-C/--config KEY=VALUEoverrides (highest precedence).
The defaults shown for Snekmer parameter flags match the template
config.yaml defaults. These defaults are applied automatically only when
no config file is in use, or when a flag is explicitly provided on the command
line.
Config File
To run Snekmer with a config file, create a config.yaml file containing
desired parameters. A
template
is included in the repository.
By default, Snekmer auto-loads ./config.yaml (or <DIR>/config.yaml
when using -d/--directory). You can specify one or more explicit config
files with --configfile, or suppress the default auto-load with
--no-default-configfile.
Running Without a Config File
A config file is no longer strictly required. You can use
--no-default-configfile and rely on built-in defaults, providing any
needed overrides via Snekmer parameter flags or -C KEY=VALUE.
Directory Structure
Snekmer assumes that input files are stored in the input directory
(configurable via --input-dir), and automatically creates an output
directory to save all output files. Snekmer also assumes background files,
if any, are stored in input/background. An example of the assumed
directory structure is shown for each execution mode of Snekmer.
Snekmer cluster, model, and search
.
├── config.yaml (optional with --no-default-configfile)
├── input/
│ ├── background/
│ │ ├── X.fasta
│ │ ├── Y.fasta
│ │ └── etc.
│ ├── A.fasta
│ ├── B.fasta
│ └── etc.
├── output/
│ ├── ...
│ └── ...
Snekmer learn
.
├── config.yaml (optional with --no-default-configfile)
├── input/
│ ├── A.fasta
│ ├── B.fasta
│ └── etc.
│ └── base/ (optional)
│ └── base-kmer-counts.csv
├── annotations/
│ └── annotations.ann
├── output/
│ ├── ...
│ └── ...
Snekmer apply
.
├── config.yaml (optional with --no-default-configfile)
├── input/
│ ├── A.fasta
│ ├── B.fasta
│ └── etc.
├── counts/
│ └── kmer_counts_total.csv
├── confidence/
│ └── global_confidence_scores.csv
├── stats/
│ └── family_summary_stats.csv
├── output/
│ ├── ...
│ └── ...
Alphabets
Snekmer supports several reduced amino acid alphabets for k-mer recoding.
You may pass either an integer (0–5), the alphabet name (e.g.
hydro), or None to the --alphabet flag.
ID |
Name |
Size |
Description |
|---|---|---|---|
0 |
hydro |
2 |
2-value hydrophobicity alphabet |
1 |
standard |
7 |
“Standard” reduction alphabet |
2 |
solvacc |
3 |
Solvent accessibility alphabet |
3 |
hydrocharge |
3 |
2-value hydrophobicity with charged residues as a third category |
4 |
hydrostruct |
3 |
2-value hydrophobicity with structural-breakers as a third category |
5 |
miqs |
10 |
MIQS alphabet |
None |
None |
20 |
No reduced alphabet |
Example: Learn→Apply Without a Config File
The following walkthrough demonstrates a complete learn then apply
workflow using only command line arguments with no config.yaml required.
The --no-default-configfile flag tells Snekmer to skip auto-loading a
config file, so all parameters come from built-in defaults and any explicit
CLI flags.
Step 1: Prepare the learn directory
Create a working directory for the learn step with the expected layout:
mkdir -p learn/input learn/annotations
Copy your training FASTA files and annotation file into place:
cp training_sequences_*.fasta learn/input/
cp annotations.ann learn/annotations/
Your directory should look like:
learn/
├── annotations/
│ └── annotations.ann
└── input/
├── training_sequences_1.fasta
├── training_sequences_2.fasta
└── ...
Step 2: Run snekmer learn
snekmer learn \
--no-default-configfile \
--k 8 \
--alphabet 2 \
--input-dir input \
--input-file-exts fasta fna faa fa \
--input-file-regex ".*" \
--no-nested-output \
--no-save-apply-associations \
--conf-weight-modifier 20 \
--selection top_hit \
--threshold Median \
--apply-output snekmer_results.csv \
-d learn
Note
The values shown above match the built-in defaults. In practice, you only
need to pass --no-default-configfile plus whichever parameters you want
to change. For example, a minimal invocation relying entirely on defaults:
snekmer learn --no-default-configfile -d learn
Advanced options such as sequence fragmentation are available via
config.yaml. See Setting up User Configuration (config.yaml) for the full parameter reference.
Step 3: Copy learn outputs into the apply directory
After learn completes, create the apply directory and copy the
handoff files:
mkdir -p apply/input apply/counts apply/confidence apply/stats
cp test_sequences.fasta apply/input/
cp learn/apply_inputs/counts/kmer_counts_total.csv apply/counts/
cp learn/apply_inputs/confidence/global_confidence_scores.csv apply/confidence/
cp learn/apply_inputs/stats/family_summary_stats.csv apply/stats/
Your apply directory should look like:
apply/
├── confidence/
│ └── global_confidence_scores.csv
├── counts/
│ └── kmer_counts_total.csv
├── input/
│ └── test_sequences.fasta
└── stats/
└── family_summary_stats.csv
Step 4: Run snekmer apply
snekmer apply \
--no-default-configfile \
--k 8 \
--alphabet 2 \
--input-dir input \
--input-file-exts fasta fna faa fa \
--input-file-regex ".*" \
--no-nested-output \
--no-save-apply-associations \
--conf-weight-modifier 20 \
--selection top_hit \
--threshold Median \
--apply-output snekmer_results.csv \
-d apply
Important
Use the same --k and --alphabet values for both learn and
apply. Mismatched encoding parameters will produce incorrect results.
Step 5: Inspect results
The final predictions are written to apply/snekmer_results.csv. You can
preview them with:
head apply/snekmer_results.csv
Partial Workflow
To execute only a part of the workflow, the --until option can be invoked.
For instance, to execute the workflow only through the kmer vector generation
step, run:
snekmer {mode} --until vectorize
Snakemake Pass-Through Arguments
The following arguments are passed through directly to Snakemake and are not Snekmer-specific:
-n,--dry-run,--dryrunDo not execute anything; display what would be done.
--configfile PATH [PATH ...]Specify or overwrite workflow config file(s). Multiple files overwrite each other in the given order.
-C,--config KEY=VALUE [KEY=VALUE ...]Set or overwrite values in the workflow config object.
--unlockUnlock the working directory.
-U,--until TARGET [TARGET ...]Run the workflow until the specified rules or files.
-k,--keepgoing,--keep-goingContinue with independent jobs if a job fails.
-w,--latency,--latency-wait,--output-waitSECONDSWait given seconds for output files to appear after job completion (default: 30).
-t,--touchTouch output files instead of running commands.
-c,--coresNUse at most N CPU cores/jobs in parallel (default: all available).
--countNNumber of files to process (limits DAG size).
--countstartIDXStarting file index for use with
--count(default: 0).--verboseShow additional debug output.
-q,--quiet[progress|rules|all]Reduce Snakemake output.
-d,--directoryDIRSpecify working directory.
-R,--forcerun[TARGET …]Force re-execution/creation of the given rules or files.
--list-code-changes,--lcList output files for which the rule body changed.
--list-params-changes,--lpList output files for which defined params changed.
--no-default-configfileDo not auto-load
./config.yaml(or<DIR>/config.yamlwith-d).--clustPATH [PATH …]Path to cluster execution YAML configuration file (e.g., for SLURM).
-j,--jobsNNumber of simultaneous jobs to submit to the scheduler (default: 1000).
--scheduler[greedy|ilp]Specify whether Snakemake uses the greedy or ILP scheduler.
--scheduler-ilp-solver[SOLVER]Specify the MILP solver to be used when using the ILP scheduler.
--scheduler-ilp-solver-path[PATH]PATH to search for ILP scheduler solver binaries.
All Options (full argparse reference)
Snekmer: A scalable pipeline for protein sequence fingerprinting using amino acid reduction (AAR).
- Modes:
cluster Unsupervised clustering workflow. model Train supervised models + cross-validation reports. search Score sequences against trained models. learn Build annotation-associated k-mer distributions + confidence evaluation. apply Predict annotations using outputs from learn. easy Guided front-end that runs learn then apply end-to-end.
- General usage:
snekmer <mode> [snakemake arguments] [snekmer parameter overrides]
- Important:
The “Snakemake arguments” below are passed through to Snakemake (they are not Snekmer-specific). Snekmer parameters can be provided via config.yaml / –configfile, or overridden via the Snekmer parameter flags shown in the section(s) relevant to your mode.
- Defaults:
The defaults shown for Snekmer parameter flags match the template config.yaml defaults. These defaults are applied automatically only when no config file is in use, or when a flag is explicitly provided.
- More help:
- Get help for any subcommand with:
snekmer <mode> -h
- Config precedence:
Default configfile (auto): ./config.yaml (or <DIR>/config.yaml with -d/–directory)
Any explicit –configfile PATH values (in order)
Any Snekmer parameter flags you explicitly provide
Any -C/–config KEY=VALUE overrides (highest)
- Running without a config file:
Use –no-default-configfile (optional) and rely on defaults and/or provide overrides.
usage: snekmer [-h] [--dry-run] [--configfile PATH [PATH ...]]
[-C [KEY=VALUE ...]] [--unlock] [--until TARGET [TARGET ...]]
[--keepgoing] [--latency SECONDS] [--touch] [--cores N]
[--count N] [--countstart IDX] [--verbose]
[--quiet [{progress,rules,all} ...]] [--directory DIR]
[--forcerun [TARGET ...]] [--list-code-changes]
[--list-params-changes] [--scheduler {greedy,ilp}]
[--scheduler-ilp-solver SCHEDULER_ILP_SOLVER]
[--scheduler-ilp-solver-path PATH] [--no-default-configfile]
[--clust PATH [PATH ...]] [-j N] [--k] [--alphabet]
[--input-dir] [--input-file-exts [...]] [--input-file-regex]
[--nested-output | --no-nested-output]
[--score-scaler | --no-score-scaler] [--score-scaler-n]
[--score-labels] [--score-lname] [--cluster-method]
[--cluster-n-clusters] [--cluster-linkage]
[--cluster-distance-threshold]
[--cluster-compute-full-tree | --no-cluster-compute-full-tree]
[--cluster-plots | --no-cluster-plots] [--cluster-min-rep]
[--cluster-max-rep]
[--cluster-save-matrix | --no-cluster-save-matrix]
[--cluster-dist-thresh] [--model-cv] [--model-random-state]
[--model-dir] [--basis-dir] [--score-dir]
[--save-apply-associations | --no-save-apply-associations]
[--conf-weight-modifier] [--fragmentation] [--fragment-version]
[--frag-length] [--min-length] [--fragment-location] [--seed]
[--selection] [--threshold] [--weight-top] [--weight-distance]
[--apply-output] [-v]
{cluster,model,search,learn,apply,easy} ...
Named Arguments
- --nested-output
Enable nested output directory structure: {save_dir}/{alphabet}/{k}.
Default: False
- --no-nested-output
Disable nested output directory structure (flat output layout).
Default: False
- --score-scaler
Enable k-mer score scaling (applies configured scaler to family scores).
Default: True
- --no-score-scaler
Disable k-mer score scaling.
Default: True
- --cluster-compute-full-tree
Compute full tree for hierarchical clustering (agglomerative).
Default: True
- --no-cluster-compute-full-tree
Do not compute full tree for hierarchical clustering (agglomerative).
Default: True
- --cluster-plots
Generate plots illustrating clustering results.
Default: False
- --no-cluster-plots
Do not generate clustering plots.
Default: False
- --cluster-save-matrix
Save distance matrices (BSF). Not recommended for large datasets.
Default: False
- --no-cluster-save-matrix
Do not save distance matrices (BSF).
Default: False
- --save-apply-associations
Save large optional outputs containing all cosine similarity scores (increases storage substantially).
Default: False
- --no-save-apply-associations
Do not save large optional cosine similarity outputs.
Default: False
- -v, --version
Print version and exit.
Snakemake arguments (passed through to Snakemake)
- --dry-run, --dryrun, -n
Do not execute anything, and display what would be done. If you have a very large workflow, use –dry-run –quiet to just print a summary of the DAG of jobs.
Default: False
- --configfile
Specify or overwrite workflow config file(s). Multiple files overwrite each other in the given order. Values are available via Snakemake’s global config dictionary.
- -C, --config
Set or overwrite values in the workflow config object (Snakemake –config KEY=VALUE).
- --unlock
Unlock the working directory.
Default: False
- --until, -U
Run the workflow until it reaches the specified rules or files.
- --keepgoing, --keep-going, -k
Continue with independent jobs if a job fails.
Default: False
- --latency, -w, --output-wait, --latency-wait
Wait given seconds for output files to appear after job completion (filesystem latency).
Default: 30
- --touch, -t
Touch output files instead of running commands (mark as up-to-date).
Default: False
- --cores, -c
Use at most N CPU cores/jobs in parallel.
Default: 2
- --count
Number of files to process (limits DAG size).
- --countstart
Starting file index (for use with –count).
Default: 0
- --verbose
Show additional debug output.
Default: False
- --quiet, -q
Possible choices: progress, rules, all
Reduce Snakemake output (progress/rules/all). If used without args, quiets progress and rules.
- --directory, -d
Specify working directory (relative paths in the Snakefile use this origin).
- --forcerun, -R
Force re-execution/creation of the given rules or files.
- --list-code-changes, --lc
List output files for which the rule body changed in the Snakefile.
Default: False
- --list-params-changes, --lp
List output files for which defined params changed in the Snakefile.
Default: False
- --scheduler
Possible choices: greedy, ilp
Snakemake scheduler plugin to use.
- --scheduler-ilp-solver
MILP solver to use with the ILP scheduler.
- --scheduler-ilp-solver-path
PATH to search for ILP scheduler solver binaries.
Snekmer configfile behavior
- --no-default-configfile
Do not auto-load ./config.yaml (or <DIR>/config.yaml when using -d).
Default: False
Snakemake cluster execution (passed through)
- --clust
Path to cluster execution YAML configuration file (e.g., for SLURM).
- -j, --jobs
Number of simultaneous jobs to submit to the scheduler.
Default: 1000
Snekmer parameters (all modes; defaults match config.yaml)
- --k
K-mer length.
Default: 8
- --alphabet
Reduced alphabet encoding (0–5, alphabet name, or ‘None’). See alphabets list below.
Default: “2”
- --input-dir
Directory containing input sequence files.
Default: “input”
- --input-file-exts
File extensions to consider valid input sequence files (space-separated).
Default: [‘fasta’, ‘fna’, ‘faa’, ‘fa’]
- --input-file-regex
Regular expression for parsing family/annotation identifiers from filenames.
Default: “.*”
Snekmer Model and Search Parameters: scoring
- --score-scaler-n
Scaler keyword argument ‘n’ (passed to the k-mer scaler).
Default: 0.25
- --score-labels
If None, uses default k-mer label set for scaler. Otherwise uses provided value (string or JSON).
- --score-lname
Label name (e.g., “family”).
Snekmer Cluster Parameters: clustering
- --cluster-method
Clustering method (options include “kmeans”, “agglomerative”, “correlation”, “density”, “birch”, “optics”, “hdbscan”).
Default: “agglomerative-jaccard”
- --cluster-n-clusters
Number of clusters (int) or ‘None’ (method-dependent).
- --cluster-linkage
Linkage method for agglomerative clustering (e.g. “average”).
Default: “average”
- --cluster-distance-threshold
Distance threshold for agglomerative clustering (method-dependent).
Default: 0.92
- --cluster-min-rep
Minimum repetition threshold for kmers (int) or ‘None’. Kmers below this are discarded.
- --cluster-max-rep
Maximum repetition threshold for kmers (int) or ‘None’. Kmers above this are discarded.
- --cluster-dist-thresh
Distance threshold for BSF matrix.
Default: 100
Snekmer Model Parameters: model training
- --model-cv
Number of cross-validation folds for model evaluation.
Default: 5
- --model-random-state
Random state for model evaluation (int) or ‘None’.
Snekmer Search Parameters: search inputs
- --model-dir
Directory containing model object(s) (.model).
Default: “output/model/”
- --basis-dir
Directory containing k-mer basis set(s) (.kmers).
Default: “output/example-model/”
- --score-dir
Directory containing scoring object(s) (.scorer).
Default: “output/scoring/”
Snekmer Learn and Apply Parameters:
- --conf-weight-modifier
Weighting modifier for updating confidence when adding data to an existing k-mer count matrix.
Default: 20
- --fragmentation
Enable training-data fragmentation (default False).
Default: False
- --fragment-version
Fragment length interpretation: ‘absolute’ or ‘percent’.
Default: “absolute”
- --frag-length
Fragment length (units depend on –fragment-version).
Default: 50
- --min-length
Minimum fragment length to retain; shorter fragments are discarded.
Default: 50
- --fragment-location
Fragment location: ‘start’, ‘end’, or ‘random’.
Default: “random”
- --seed
Random seed for reproducible fragmentation.
Default: 999
- --selection
Possible choices: top_hit, greatest_distance, combined_distance
Annotation selection method.
Default: “top_hit”
- --threshold
Family-specific threshold used for prediction filtering (e.g. ‘Median’, ‘Mean’, ‘90th Percentile’, or ‘None’).
Default: “Median”
- --weight-top
Weight for ‘top_hit’ when selection method is ‘combined_distance’.
Default: 0.7
- --weight-distance
Weight for ‘greatest_distance’ when selection method is ‘combined_distance’.
Default: 0.3
- --apply-output
Output filename for apply results in single-file format.
Default: “snekmer_results.csv”
mode
Snekmer mode (cluster, model, search, learn, apply, easy).
- mode
Possible choices: cluster, model, search, learn, apply, easy
Sub-commands
cluster
Unsupervised clustering workflow.
snekmer cluster [options]
Named Arguments
- --nested-output
Enable nested output directory structure: {save_dir}/{alphabet}/{k}.
Default: False
- --no-nested-output
Disable nested output directory structure (flat output layout).
Default: False
- --cluster-compute-full-tree
Compute full tree for hierarchical clustering (agglomerative).
Default: True
- --no-cluster-compute-full-tree
Do not compute full tree for hierarchical clustering (agglomerative).
Default: True
- --cluster-plots
Generate plots illustrating clustering results.
Default: False
- --no-cluster-plots
Do not generate clustering plots.
Default: False
- --cluster-save-matrix
Save distance matrices (BSF). Not recommended for large datasets.
Default: False
- --no-cluster-save-matrix
Do not save distance matrices (BSF).
Default: False
Snakemake arguments (passed through to Snakemake)
- --dry-run, --dryrun, -n
Do not execute anything, and display what would be done. If you have a very large workflow, use –dry-run –quiet to just print a summary of the DAG of jobs.
Default: False
- --configfile
Specify or overwrite workflow config file(s). Multiple files overwrite each other in the given order. Values are available via Snakemake’s global config dictionary.
- -C, --config
Set or overwrite values in the workflow config object (Snakemake –config KEY=VALUE).
- --unlock
Unlock the working directory.
Default: False
- --until, -U
Run the workflow until it reaches the specified rules or files.
- --keepgoing, --keep-going, -k
Continue with independent jobs if a job fails.
Default: False
- --latency, -w, --output-wait, --latency-wait
Wait given seconds for output files to appear after job completion (filesystem latency).
Default: 30
- --touch, -t
Touch output files instead of running commands (mark as up-to-date).
Default: False
- --cores, -c
Use at most N CPU cores/jobs in parallel.
Default: 2
- --count
Number of files to process (limits DAG size).
- --countstart
Starting file index (for use with –count).
Default: 0
- --verbose
Show additional debug output.
Default: False
- --quiet, -q
Possible choices: progress, rules, all
Reduce Snakemake output (progress/rules/all). If used without args, quiets progress and rules.
- --directory, -d
Specify working directory (relative paths in the Snakefile use this origin).
- --forcerun, -R
Force re-execution/creation of the given rules or files.
- --list-code-changes, --lc
List output files for which the rule body changed in the Snakefile.
Default: False
- --list-params-changes, --lp
List output files for which defined params changed in the Snakefile.
Default: False
- --scheduler
Possible choices: greedy, ilp
Snakemake scheduler plugin to use.
- --scheduler-ilp-solver
MILP solver to use with the ILP scheduler.
- --scheduler-ilp-solver-path
PATH to search for ILP scheduler solver binaries.
Snekmer configfile behavior
- --no-default-configfile
Do not auto-load ./config.yaml (or <DIR>/config.yaml when using -d).
Default: False
Snakemake cluster execution (passed through)
- --clust
Path to cluster execution YAML configuration file (e.g., for SLURM).
- -j, --jobs
Number of simultaneous jobs to submit to the scheduler.
Default: 1000
Snekmer parameters (all modes; defaults match config.yaml)
- --k
K-mer length.
Default: 8
- --alphabet
Reduced alphabet encoding (0–5, alphabet name, or ‘None’). See alphabets list below.
Default: “2”
- --input-dir
Directory containing input sequence files.
Default: “input”
- --input-file-exts
File extensions to consider valid input sequence files (space-separated).
Default: [‘fasta’, ‘fna’, ‘faa’, ‘fa’]
- --input-file-regex
Regular expression for parsing family/annotation identifiers from filenames.
Default: “.*”
Snekmer Cluster Parameters: clustering
- --cluster-method
Clustering method (options include “kmeans”, “agglomerative”, “correlation”, “density”, “birch”, “optics”, “hdbscan”).
Default: “agglomerative-jaccard”
- --cluster-n-clusters
Number of clusters (int) or ‘None’ (method-dependent).
- --cluster-linkage
Linkage method for agglomerative clustering (e.g. “average”).
Default: “average”
- --cluster-distance-threshold
Distance threshold for agglomerative clustering (method-dependent).
Default: 0.92
- --cluster-min-rep
Minimum repetition threshold for kmers (int) or ‘None’. Kmers below this are discarded.
- --cluster-max-rep
Maximum repetition threshold for kmers (int) or ‘None’. Kmers above this are discarded.
- --cluster-dist-thresh
Distance threshold for BSF matrix.
Default: 100
model
Train supervised models + cross-validation reports.
snekmer model [options]
Named Arguments
- --nested-output
Enable nested output directory structure: {save_dir}/{alphabet}/{k}.
Default: False
- --no-nested-output
Disable nested output directory structure (flat output layout).
Default: False
- --score-scaler
Enable k-mer score scaling (applies configured scaler to family scores).
Default: True
- --no-score-scaler
Disable k-mer score scaling.
Default: True
Snakemake arguments (passed through to Snakemake)
- --dry-run, --dryrun, -n
Do not execute anything, and display what would be done. If you have a very large workflow, use –dry-run –quiet to just print a summary of the DAG of jobs.
Default: False
- --configfile
Specify or overwrite workflow config file(s). Multiple files overwrite each other in the given order. Values are available via Snakemake’s global config dictionary.
- -C, --config
Set or overwrite values in the workflow config object (Snakemake –config KEY=VALUE).
- --unlock
Unlock the working directory.
Default: False
- --until, -U
Run the workflow until it reaches the specified rules or files.
- --keepgoing, --keep-going, -k
Continue with independent jobs if a job fails.
Default: False
- --latency, -w, --output-wait, --latency-wait
Wait given seconds for output files to appear after job completion (filesystem latency).
Default: 30
- --touch, -t
Touch output files instead of running commands (mark as up-to-date).
Default: False
- --cores, -c
Use at most N CPU cores/jobs in parallel.
Default: 2
- --count
Number of files to process (limits DAG size).
- --countstart
Starting file index (for use with –count).
Default: 0
- --verbose
Show additional debug output.
Default: False
- --quiet, -q
Possible choices: progress, rules, all
Reduce Snakemake output (progress/rules/all). If used without args, quiets progress and rules.
- --directory, -d
Specify working directory (relative paths in the Snakefile use this origin).
- --forcerun, -R
Force re-execution/creation of the given rules or files.
- --list-code-changes, --lc
List output files for which the rule body changed in the Snakefile.
Default: False
- --list-params-changes, --lp
List output files for which defined params changed in the Snakefile.
Default: False
- --scheduler
Possible choices: greedy, ilp
Snakemake scheduler plugin to use.
- --scheduler-ilp-solver
MILP solver to use with the ILP scheduler.
- --scheduler-ilp-solver-path
PATH to search for ILP scheduler solver binaries.
Snekmer configfile behavior
- --no-default-configfile
Do not auto-load ./config.yaml (or <DIR>/config.yaml when using -d).
Default: False
Snakemake cluster execution (passed through)
- --clust
Path to cluster execution YAML configuration file (e.g., for SLURM).
- -j, --jobs
Number of simultaneous jobs to submit to the scheduler.
Default: 1000
Snekmer parameters (all modes; defaults match config.yaml)
- --k
K-mer length.
Default: 8
- --alphabet
Reduced alphabet encoding (0–5, alphabet name, or ‘None’). See alphabets list below.
Default: “2”
- --input-dir
Directory containing input sequence files.
Default: “input”
- --input-file-exts
File extensions to consider valid input sequence files (space-separated).
Default: [‘fasta’, ‘fna’, ‘faa’, ‘fa’]
- --input-file-regex
Regular expression for parsing family/annotation identifiers from filenames.
Default: “.*”
Snekmer Model and Search Parameters: scoring
- --score-scaler-n
Scaler keyword argument ‘n’ (passed to the k-mer scaler).
Default: 0.25
- --score-labels
If None, uses default k-mer label set for scaler. Otherwise uses provided value (string or JSON).
- --score-lname
Label name (e.g., “family”).
Snekmer Model Parameters: model training
- --model-cv
Number of cross-validation folds for model evaluation.
Default: 5
- --model-random-state
Random state for model evaluation (int) or ‘None’.
search
Score sequences against trained models.
snekmer search [options]
Named Arguments
- --nested-output
Enable nested output directory structure: {save_dir}/{alphabet}/{k}.
Default: False
- --no-nested-output
Disable nested output directory structure (flat output layout).
Default: False
- --score-scaler
Enable k-mer score scaling (applies configured scaler to family scores).
Default: True
- --no-score-scaler
Disable k-mer score scaling.
Default: True
Snakemake arguments (passed through to Snakemake)
- --dry-run, --dryrun, -n
Do not execute anything, and display what would be done. If you have a very large workflow, use –dry-run –quiet to just print a summary of the DAG of jobs.
Default: False
- --configfile
Specify or overwrite workflow config file(s). Multiple files overwrite each other in the given order. Values are available via Snakemake’s global config dictionary.
- -C, --config
Set or overwrite values in the workflow config object (Snakemake –config KEY=VALUE).
- --unlock
Unlock the working directory.
Default: False
- --until, -U
Run the workflow until it reaches the specified rules or files.
- --keepgoing, --keep-going, -k
Continue with independent jobs if a job fails.
Default: False
- --latency, -w, --output-wait, --latency-wait
Wait given seconds for output files to appear after job completion (filesystem latency).
Default: 30
- --touch, -t
Touch output files instead of running commands (mark as up-to-date).
Default: False
- --cores, -c
Use at most N CPU cores/jobs in parallel.
Default: 2
- --count
Number of files to process (limits DAG size).
- --countstart
Starting file index (for use with –count).
Default: 0
- --verbose
Show additional debug output.
Default: False
- --quiet, -q
Possible choices: progress, rules, all
Reduce Snakemake output (progress/rules/all). If used without args, quiets progress and rules.
- --directory, -d
Specify working directory (relative paths in the Snakefile use this origin).
- --forcerun, -R
Force re-execution/creation of the given rules or files.
- --list-code-changes, --lc
List output files for which the rule body changed in the Snakefile.
Default: False
- --list-params-changes, --lp
List output files for which defined params changed in the Snakefile.
Default: False
- --scheduler
Possible choices: greedy, ilp
Snakemake scheduler plugin to use.
- --scheduler-ilp-solver
MILP solver to use with the ILP scheduler.
- --scheduler-ilp-solver-path
PATH to search for ILP scheduler solver binaries.
Snekmer configfile behavior
- --no-default-configfile
Do not auto-load ./config.yaml (or <DIR>/config.yaml when using -d).
Default: False
Snakemake cluster execution (passed through)
- --clust
Path to cluster execution YAML configuration file (e.g., for SLURM).
- -j, --jobs
Number of simultaneous jobs to submit to the scheduler.
Default: 1000
Snekmer parameters (all modes; defaults match config.yaml)
- --k
K-mer length.
Default: 8
- --alphabet
Reduced alphabet encoding (0–5, alphabet name, or ‘None’). See alphabets list below.
Default: “2”
- --input-dir
Directory containing input sequence files.
Default: “input”
- --input-file-exts
File extensions to consider valid input sequence files (space-separated).
Default: [‘fasta’, ‘fna’, ‘faa’, ‘fa’]
- --input-file-regex
Regular expression for parsing family/annotation identifiers from filenames.
Default: “.*”
Snekmer Model and Search Parameters: scoring
- --score-scaler-n
Scaler keyword argument ‘n’ (passed to the k-mer scaler).
Default: 0.25
- --score-labels
If None, uses default k-mer label set for scaler. Otherwise uses provided value (string or JSON).
- --score-lname
Label name (e.g., “family”).
Snekmer Search Parameters: search inputs
- --model-dir
Directory containing model object(s) (.model).
Default: “output/model/”
- --basis-dir
Directory containing k-mer basis set(s) (.kmers).
Default: “output/example-model/”
- --score-dir
Directory containing scoring object(s) (.scorer).
Default: “output/scoring/”
learn
Build annotation-associated k-mer distributions + confidence evaluation.
snekmer learn [options]
Named Arguments
- --nested-output
Enable nested output directory structure: {save_dir}/{alphabet}/{k}.
Default: False
- --no-nested-output
Disable nested output directory structure (flat output layout).
Default: False
- --save-apply-associations
Save large optional outputs containing all cosine similarity scores (increases storage substantially).
Default: False
- --no-save-apply-associations
Do not save large optional cosine similarity outputs.
Default: False
Snakemake arguments (passed through to Snakemake)
- --dry-run, --dryrun, -n
Do not execute anything, and display what would be done. If you have a very large workflow, use –dry-run –quiet to just print a summary of the DAG of jobs.
Default: False
- --configfile
Specify or overwrite workflow config file(s). Multiple files overwrite each other in the given order. Values are available via Snakemake’s global config dictionary.
- -C, --config
Set or overwrite values in the workflow config object (Snakemake –config KEY=VALUE).
- --unlock
Unlock the working directory.
Default: False
- --until, -U
Run the workflow until it reaches the specified rules or files.
- --keepgoing, --keep-going, -k
Continue with independent jobs if a job fails.
Default: False
- --latency, -w, --output-wait, --latency-wait
Wait given seconds for output files to appear after job completion (filesystem latency).
Default: 30
- --touch, -t
Touch output files instead of running commands (mark as up-to-date).
Default: False
- --cores, -c
Use at most N CPU cores/jobs in parallel.
Default: 2
- --count
Number of files to process (limits DAG size).
- --countstart
Starting file index (for use with –count).
Default: 0
- --verbose
Show additional debug output.
Default: False
- --quiet, -q
Possible choices: progress, rules, all
Reduce Snakemake output (progress/rules/all). If used without args, quiets progress and rules.
- --directory, -d
Specify working directory (relative paths in the Snakefile use this origin).
- --forcerun, -R
Force re-execution/creation of the given rules or files.
- --list-code-changes, --lc
List output files for which the rule body changed in the Snakefile.
Default: False
- --list-params-changes, --lp
List output files for which defined params changed in the Snakefile.
Default: False
- --scheduler
Possible choices: greedy, ilp
Snakemake scheduler plugin to use.
- --scheduler-ilp-solver
MILP solver to use with the ILP scheduler.
- --scheduler-ilp-solver-path
PATH to search for ILP scheduler solver binaries.
Snekmer configfile behavior
- --no-default-configfile
Do not auto-load ./config.yaml (or <DIR>/config.yaml when using -d).
Default: False
Snakemake cluster execution (passed through)
- --clust
Path to cluster execution YAML configuration file (e.g., for SLURM).
- -j, --jobs
Number of simultaneous jobs to submit to the scheduler.
Default: 1000
Snekmer parameters (all modes; defaults match config.yaml)
- --k
K-mer length.
Default: 8
- --alphabet
Reduced alphabet encoding (0–5, alphabet name, or ‘None’). See alphabets list below.
Default: “2”
- --input-dir
Directory containing input sequence files.
Default: “input”
- --input-file-exts
File extensions to consider valid input sequence files (space-separated).
Default: [‘fasta’, ‘fna’, ‘faa’, ‘fa’]
- --input-file-regex
Regular expression for parsing family/annotation identifiers from filenames.
Default: “.*”
Snekmer Learn and Apply Parameters:
- --conf-weight-modifier
Weighting modifier for updating confidence when adding data to an existing k-mer count matrix.
Default: 20
- --fragmentation
Enable training-data fragmentation (default False).
Default: False
- --fragment-version
Fragment length interpretation: ‘absolute’ or ‘percent’.
Default: “absolute”
- --frag-length
Fragment length (units depend on –fragment-version).
Default: 50
- --min-length
Minimum fragment length to retain; shorter fragments are discarded.
Default: 50
- --fragment-location
Fragment location: ‘start’, ‘end’, or ‘random’.
Default: “random”
- --seed
Random seed for reproducible fragmentation.
Default: 999
- --selection
Possible choices: top_hit, greatest_distance, combined_distance
Annotation selection method.
Default: “top_hit”
- --threshold
Family-specific threshold used for prediction filtering (e.g. ‘Median’, ‘Mean’, ‘90th Percentile’, or ‘None’).
Default: “Median”
- --weight-top
Weight for ‘top_hit’ when selection method is ‘combined_distance’.
Default: 0.7
- --weight-distance
Weight for ‘greatest_distance’ when selection method is ‘combined_distance’.
Default: 0.3
- --apply-output
Output filename for apply results in single-file format.
Default: “snekmer_results.csv”
apply
Predict annotations using outputs from learn.
snekmer apply [options]
Named Arguments
- --nested-output
Enable nested output directory structure: {save_dir}/{alphabet}/{k}.
Default: False
- --no-nested-output
Disable nested output directory structure (flat output layout).
Default: False
- --save-apply-associations
Save large optional outputs containing all cosine similarity scores (increases storage substantially).
Default: False
- --no-save-apply-associations
Do not save large optional cosine similarity outputs.
Default: False
Snakemake arguments (passed through to Snakemake)
- --dry-run, --dryrun, -n
Do not execute anything, and display what would be done. If you have a very large workflow, use –dry-run –quiet to just print a summary of the DAG of jobs.
Default: False
- --configfile
Specify or overwrite workflow config file(s). Multiple files overwrite each other in the given order. Values are available via Snakemake’s global config dictionary.
- -C, --config
Set or overwrite values in the workflow config object (Snakemake –config KEY=VALUE).
- --unlock
Unlock the working directory.
Default: False
- --until, -U
Run the workflow until it reaches the specified rules or files.
- --keepgoing, --keep-going, -k
Continue with independent jobs if a job fails.
Default: False
- --latency, -w, --output-wait, --latency-wait
Wait given seconds for output files to appear after job completion (filesystem latency).
Default: 30
- --touch, -t
Touch output files instead of running commands (mark as up-to-date).
Default: False
- --cores, -c
Use at most N CPU cores/jobs in parallel.
Default: 2
- --count
Number of files to process (limits DAG size).
- --countstart
Starting file index (for use with –count).
Default: 0
- --verbose
Show additional debug output.
Default: False
- --quiet, -q
Possible choices: progress, rules, all
Reduce Snakemake output (progress/rules/all). If used without args, quiets progress and rules.
- --directory, -d
Specify working directory (relative paths in the Snakefile use this origin).
- --forcerun, -R
Force re-execution/creation of the given rules or files.
- --list-code-changes, --lc
List output files for which the rule body changed in the Snakefile.
Default: False
- --list-params-changes, --lp
List output files for which defined params changed in the Snakefile.
Default: False
- --scheduler
Possible choices: greedy, ilp
Snakemake scheduler plugin to use.
- --scheduler-ilp-solver
MILP solver to use with the ILP scheduler.
- --scheduler-ilp-solver-path
PATH to search for ILP scheduler solver binaries.
Snekmer configfile behavior
- --no-default-configfile
Do not auto-load ./config.yaml (or <DIR>/config.yaml when using -d).
Default: False
Snakemake cluster execution (passed through)
- --clust
Path to cluster execution YAML configuration file (e.g., for SLURM).
- -j, --jobs
Number of simultaneous jobs to submit to the scheduler.
Default: 1000
Snekmer parameters (all modes; defaults match config.yaml)
- --k
K-mer length.
Default: 8
- --alphabet
Reduced alphabet encoding (0–5, alphabet name, or ‘None’). See alphabets list below.
Default: “2”
- --input-dir
Directory containing input sequence files.
Default: “input”
- --input-file-exts
File extensions to consider valid input sequence files (space-separated).
Default: [‘fasta’, ‘fna’, ‘faa’, ‘fa’]
- --input-file-regex
Regular expression for parsing family/annotation identifiers from filenames.
Default: “.*”
Snekmer Learn and Apply Parameters:
- --conf-weight-modifier
Weighting modifier for updating confidence when adding data to an existing k-mer count matrix.
Default: 20
- --fragmentation
Enable training-data fragmentation (default False).
Default: False
- --fragment-version
Fragment length interpretation: ‘absolute’ or ‘percent’.
Default: “absolute”
- --frag-length
Fragment length (units depend on –fragment-version).
Default: 50
- --min-length
Minimum fragment length to retain; shorter fragments are discarded.
Default: 50
- --fragment-location
Fragment location: ‘start’, ‘end’, or ‘random’.
Default: “random”
- --seed
Random seed for reproducible fragmentation.
Default: 999
- --selection
Possible choices: top_hit, greatest_distance, combined_distance
Annotation selection method.
Default: “top_hit”
- --threshold
Family-specific threshold used for prediction filtering (e.g. ‘Median’, ‘Mean’, ‘90th Percentile’, or ‘None’).
Default: “Median”
- --weight-top
Weight for ‘top_hit’ when selection method is ‘combined_distance’.
Default: 0.7
- --weight-distance
Weight for ‘greatest_distance’ when selection method is ‘combined_distance’.
Default: 0.3
- --apply-output
Output filename for apply results in single-file format.
Default: “snekmer_results.csv”
easy
Guided front-end that runs learn then apply end-to-end.
Prompts for training sequences, query sequences, and annotation style, then builds a self-contained workspace and runs both pipeline steps. All prompts can be skipped by supplying the corresponding flags.
snekmer easy [options]
Input / output
- --train
Path to training sequences (FASTA file or directory of FASTA files). If omitted, the wizard will prompt for it.
- --query
Path to query sequences to annotate (FASTA file or directory). If omitted, the wizard will prompt for it.
- --output
Output directory for the workspace. If omitted, the wizard will prompt.
Annotation (choose one)
- --ann
Path to an existing annotation file (.ann). Format: tab-separated with columns ‘id’ and ‘family’.
- --create-ann
Generate annotations from training FASTA headers. Requires headers in the format: >db|FAMILY_LABEL|seqid description (the field between the first pair of | | becomes the family label).
Default: False
K-mer parameters
- --k
K-mer length.
Default: 8
- --alphabet
Reduced alphabet encoding (0–5, alphabet name, or ‘None’). 2 = solvacc (3-letter). See alphabets list below.
Default: “2”
Learn / apply options
- --selection
Possible choices: top_hit, greatest_distance, combined_distance
Annotation selection method {top_hit, greatest_distance, combined_distance}.
Default: “top_hit”
- --threshold
Family-specific score threshold for prediction filtering. Options: ‘Median’, ‘Mean’, ‘90th Percentile’, ‘None’.
Default: “Median”
- --apply-output
Output filename for apply results.
Default: “snekmer_results.csv”
Fragmentation (advanced)
- --fragmentation
Split sequences into fragments before kmerization.
Default: False
- --frag-length
Fragment length in residues (default: 50).
- --min-length
Minimum sequence length to fragment (default: 50).
- --fragment-version
Fragmentation version (default: absolute).
- --fragment-location
Fragment location method (default: random).
- --seed
Random seed for fragmentation (default: 999).
Snakemake options
- --cores, -c
CPU cores to use.
Default: 2
- --dry-run, -n
Show what would be done without executing.
Default: False
- --verbose
Show additional Snakemake debug output.
Default: False
- --quiet, -q
Possible choices: progress, rules, all
Reduce Snakemake output.
Miscellaneous
- --copy-files
Copy input files into the workspace instead of symlinking them (useful when the workspace will be moved or shared).
Default: False
- Alphabets (k-mer recoding):
0: hydro (size 2) — 2-value hydrophobicity alphabet 1: standard (size 7) — “Standard” reduction alphabet 2: solvacc (size 3) — Solvent accessibility alphabet 3: hydrocharge (size 3) — 2-value hydrophobicity with charged residues as a third category 4: hydrostruct (size 3) — 2-value hydrophobicity with structural-breakers as a third category 5: miqs (size 10) — MIQS alphabet3 None: None (size 20) — No reduced alphabet
You may pass either an integer (0–5) or the alphabet name (e.g. ‘hydro’), or ‘None’.
- Alphabets (k-mer recoding):
0: hydro (size 2) — 2-value hydrophobicity alphabet 1: standard (size 7) — “Standard” reduction alphabet 2: solvacc (size 3) — Solvent accessibility alphabet 3: hydrocharge (size 3) — 2-value hydrophobicity with charged residues as a third category 4: hydrostruct (size 3) — 2-value hydrophobicity with structural-breakers as a third category 5: miqs (size 10) — MIQS alphabet3 None: None (size 20) — No reduced alphabet
You may pass either an integer (0–5) or the alphabet name (e.g. ‘hydro’), or ‘None’.