Command Line Interface

To run any Snekmer operation mode, call:

snekmer {mode}

where {mode} is one of easy, learn, apply, cluster, model, or search.

easy is the recommended entry point for new users. It runs the full Learn/Apply pipeline with a single command and no manual directory setup. See the Quick Start to get started immediately.

General usage follows the pattern:

snekmer <mode> [snakemake arguments] [snekmer parameter overrides]

Snakemake arguments are passed through directly to Snakemake (they are not Snekmer-specific). Snekmer parameters can be provided via config.yaml / --configfile, or overridden via Snekmer parameter flags on the command line.

For an overview of Snekmer usage, reference the help command (snekmer --help).

$ snekmer --help
Snekmer 1.4.1 — protein sequence fingerprinting via amino acid reduction.

Usage:
  snekmer <mode> [options]
  snekmer <mode> --help

Modes:
  cluster            Unsupervised clustering workflow.
  model              Train supervised models + cross-validation reports.
  search             Score sequences against trained models.
  learn              Build annotation-associated k-mer distributions + confidence evaluation.
  apply              Predict annotations using outputs from learn.
  easy               Guided front-end that runs learn then apply end-to-end.

Global options (accepted by all modes):
  --k N           K-mer length (default: 8).
  --alphabet N    Reduced alphabet encoding 0-5 or name (default: 2 = solvacc).
  --cores N       CPU cores to use (default: all).
  --dry-run       Show what would be done without executing.
  --configfile    Path(s) to YAML/JSON config file(s).
  -v, --version   Print version and exit.
  -h, --help      Show this help message and exit.

Run 'snekmer <mode> --help' for full options for a specific mode.

Tailored references for the individual operation modes can be accessed via snekmer {mode} --help. Each subcommand includes only the Snekmer parameter sections relevant to that mode.

Configuration

Config Precedence

Snekmer resolves configuration using the following precedence order (lowest to highest):

  1. Default configfile (auto): ./config.yaml (or <DIR>/config.yaml when using -d/--directory).

  2. Explicit configfiles: Any --configfile PATH values, applied in the order given.

  3. Snekmer parameter flags: Any Snekmer-specific flags you explicitly provide on the command line (e.g. --k 10, --alphabet hydro).

  4. Key=Value overrides: Any -C/--config KEY=VALUE overrides (highest precedence).

The defaults shown for Snekmer parameter flags match the template config.yaml defaults. These defaults are applied automatically only when no config file is in use, or when a flag is explicitly provided on the command line.

Config File

To run Snekmer with a config file, create a config.yaml file containing desired parameters. A template is included in the repository.

By default, Snekmer auto-loads ./config.yaml (or <DIR>/config.yaml when using -d/--directory). You can specify one or more explicit config files with --configfile, or suppress the default auto-load with --no-default-configfile.

Running Without a Config File

A config file is no longer strictly required. You can use --no-default-configfile and rely on built-in defaults, providing any needed overrides via Snekmer parameter flags or -C KEY=VALUE.

Directory Structure

Snekmer assumes that input files are stored in the input directory (configurable via --input-dir), and automatically creates an output directory to save all output files. Snekmer also assumes background files, if any, are stored in input/background. An example of the assumed directory structure is shown for each execution mode of Snekmer.

Snekmer cluster, model, and search

.
├── config.yaml          (optional with --no-default-configfile)
├── input/
│   ├── background/
│   │   ├── X.fasta
│   │   ├── Y.fasta
│   │   └── etc.
│   ├── A.fasta
│   ├── B.fasta
│   └── etc.
├── output/
│   ├── ...
│   └── ...

Snekmer learn

.
├── config.yaml          (optional with --no-default-configfile)
├── input/
│   ├── A.fasta
│   ├── B.fasta
│   └── etc.
│   └── base/            (optional)
│      └── base-kmer-counts.csv
├── annotations/
│   └── annotations.ann
├── output/
│   ├── ...
│   └── ...

Snekmer apply

.
├── config.yaml          (optional with --no-default-configfile)
├── input/
│   ├── A.fasta
│   ├── B.fasta
│   └── etc.
├── counts/
│   └── kmer_counts_total.csv
├── confidence/
│   └── global_confidence_scores.csv
├── stats/
│   └── family_summary_stats.csv
├── output/
│   ├── ...
│   └── ...
Alphabets

Snekmer supports several reduced amino acid alphabets for k-mer recoding. You may pass either an integer (05), the alphabet name (e.g. hydro), or None to the --alphabet flag.

ID

Name

Size

Description

0

hydro

2

2-value hydrophobicity alphabet

1

standard

7

“Standard” reduction alphabet

2

solvacc

3

Solvent accessibility alphabet

3

hydrocharge

3

2-value hydrophobicity with charged residues as a third category

4

hydrostruct

3

2-value hydrophobicity with structural-breakers as a third category

5

miqs

10

MIQS alphabet

None

None

20

No reduced alphabet

Example: Learn→Apply Without a Config File

The following walkthrough demonstrates a complete learn then apply workflow using only command line arguments with no config.yaml required. The --no-default-configfile flag tells Snekmer to skip auto-loading a config file, so all parameters come from built-in defaults and any explicit CLI flags.

Step 1: Prepare the learn directory

Create a working directory for the learn step with the expected layout:

mkdir -p learn/input learn/annotations

Copy your training FASTA files and annotation file into place:

cp training_sequences_*.fasta learn/input/
cp annotations.ann            learn/annotations/

Your directory should look like:

learn/
├── annotations/
│   └── annotations.ann
└── input/
    ├── training_sequences_1.fasta
    ├── training_sequences_2.fasta
    └── ...

Step 2: Run snekmer learn

snekmer learn \
    --no-default-configfile \
    --k 8 \
    --alphabet 2 \
    --input-dir input \
    --input-file-exts fasta fna faa fa \
    --input-file-regex ".*" \
    --no-nested-output \
    --no-save-apply-associations \
    --conf-weight-modifier 20 \
    --selection top_hit \
    --threshold Median \
    --apply-output snekmer_results.csv \
    -d learn

Note

The values shown above match the built-in defaults. In practice, you only need to pass --no-default-configfile plus whichever parameters you want to change. For example, a minimal invocation relying entirely on defaults:

snekmer learn --no-default-configfile -d learn

Advanced options such as sequence fragmentation are available via config.yaml. See Setting up User Configuration (config.yaml) for the full parameter reference.

Step 3: Copy learn outputs into the apply directory

After learn completes, create the apply directory and copy the handoff files:

mkdir -p apply/input apply/counts apply/confidence apply/stats

cp test_sequences.fasta                                      apply/input/

cp learn/apply_inputs/counts/kmer_counts_total.csv           apply/counts/
cp learn/apply_inputs/confidence/global_confidence_scores.csv apply/confidence/
cp learn/apply_inputs/stats/family_summary_stats.csv         apply/stats/

Your apply directory should look like:

apply/
├── confidence/
│   └── global_confidence_scores.csv
├── counts/
│   └── kmer_counts_total.csv
├── input/
│   └── test_sequences.fasta
└── stats/
    └── family_summary_stats.csv

Step 4: Run snekmer apply

snekmer apply \
    --no-default-configfile \
    --k 8 \
    --alphabet 2 \
    --input-dir input \
    --input-file-exts fasta fna faa fa \
    --input-file-regex ".*" \
    --no-nested-output \
    --no-save-apply-associations \
    --conf-weight-modifier 20 \
    --selection top_hit \
    --threshold Median \
    --apply-output snekmer_results.csv \
    -d apply

Important

Use the same --k and --alphabet values for both learn and apply. Mismatched encoding parameters will produce incorrect results.

Step 5: Inspect results

The final predictions are written to apply/snekmer_results.csv. You can preview them with:

head apply/snekmer_results.csv
Partial Workflow

To execute only a part of the workflow, the --until option can be invoked. For instance, to execute the workflow only through the kmer vector generation step, run:

snekmer {mode} --until vectorize
Snakemake Pass-Through Arguments

The following arguments are passed through directly to Snakemake and are not Snekmer-specific:

-n, --dry-run, --dryrun

Do not execute anything; display what would be done.

--configfile PATH [PATH ...]

Specify or overwrite workflow config file(s). Multiple files overwrite each other in the given order.

-C, --config KEY=VALUE [KEY=VALUE ...]

Set or overwrite values in the workflow config object.

--unlock

Unlock the working directory.

-U, --until TARGET [TARGET ...]

Run the workflow until the specified rules or files.

-k, --keepgoing, --keep-going

Continue with independent jobs if a job fails.

-w, --latency, --latency-wait, --output-wait SECONDS

Wait given seconds for output files to appear after job completion (default: 30).

-t, --touch

Touch output files instead of running commands.

-c, --cores N

Use at most N CPU cores/jobs in parallel (default: all available).

--count N

Number of files to process (limits DAG size).

--countstart IDX

Starting file index for use with --count (default: 0).

--verbose

Show additional debug output.

-q, --quiet [progress|rules|all]

Reduce Snakemake output.

-d, --directory DIR

Specify working directory.

-R, --forcerun [TARGET …]

Force re-execution/creation of the given rules or files.

--list-code-changes, --lc

List output files for which the rule body changed.

--list-params-changes, --lp

List output files for which defined params changed.

--no-default-configfile

Do not auto-load ./config.yaml (or <DIR>/config.yaml with -d).

--clust PATH [PATH …]

Path to cluster execution YAML configuration file (e.g., for SLURM).

-j, --jobs N

Number of simultaneous jobs to submit to the scheduler (default: 1000).

--scheduler [greedy|ilp]

Specify whether Snakemake uses the greedy or ILP scheduler.

--scheduler-ilp-solver [SOLVER]

Specify the MILP solver to be used when using the ILP scheduler.

--scheduler-ilp-solver-path [PATH]

PATH to search for ILP scheduler solver binaries.

All Options (full argparse reference)

Snekmer: A scalable pipeline for protein sequence fingerprinting using amino acid reduction (AAR).

Modes:

cluster Unsupervised clustering workflow. model Train supervised models + cross-validation reports. search Score sequences against trained models. learn Build annotation-associated k-mer distributions + confidence evaluation. apply Predict annotations using outputs from learn. easy Guided front-end that runs learn then apply end-to-end.

General usage:

snekmer <mode> [snakemake arguments] [snekmer parameter overrides]

Important:

The “Snakemake arguments” below are passed through to Snakemake (they are not Snekmer-specific). Snekmer parameters can be provided via config.yaml / –configfile, or overridden via the Snekmer parameter flags shown in the section(s) relevant to your mode.

Defaults:

The defaults shown for Snekmer parameter flags match the template config.yaml defaults. These defaults are applied automatically only when no config file is in use, or when a flag is explicitly provided.

More help:
Get help for any subcommand with:

snekmer <mode> -h

Config precedence:
  1. Default configfile (auto): ./config.yaml (or <DIR>/config.yaml with -d/–directory)

  2. Any explicit –configfile PATH values (in order)

  3. Any Snekmer parameter flags you explicitly provide

  4. Any -C/–config KEY=VALUE overrides (highest)

Running without a config file:

Use –no-default-configfile (optional) and rely on defaults and/or provide overrides.

usage: snekmer [-h] [--dry-run] [--configfile PATH [PATH ...]]
               [-C [KEY=VALUE ...]] [--unlock] [--until TARGET [TARGET ...]]
               [--keepgoing] [--latency SECONDS] [--touch] [--cores N]
               [--count N] [--countstart IDX] [--verbose]
               [--quiet [{progress,rules,all} ...]] [--directory DIR]
               [--forcerun [TARGET ...]] [--list-code-changes]
               [--list-params-changes] [--scheduler {greedy,ilp}]
               [--scheduler-ilp-solver SCHEDULER_ILP_SOLVER]
               [--scheduler-ilp-solver-path PATH] [--no-default-configfile]
               [--clust PATH [PATH ...]] [-j N] [--k] [--alphabet]
               [--input-dir] [--input-file-exts  [...]] [--input-file-regex]
               [--nested-output | --no-nested-output]
               [--score-scaler | --no-score-scaler] [--score-scaler-n]
               [--score-labels] [--score-lname] [--cluster-method]
               [--cluster-n-clusters] [--cluster-linkage]
               [--cluster-distance-threshold]
               [--cluster-compute-full-tree | --no-cluster-compute-full-tree]
               [--cluster-plots | --no-cluster-plots] [--cluster-min-rep]
               [--cluster-max-rep]
               [--cluster-save-matrix | --no-cluster-save-matrix]
               [--cluster-dist-thresh] [--model-cv] [--model-random-state]
               [--model-dir] [--basis-dir] [--score-dir]
               [--save-apply-associations | --no-save-apply-associations]
               [--conf-weight-modifier] [--fragmentation] [--fragment-version]
               [--frag-length] [--min-length] [--fragment-location] [--seed]
               [--selection] [--threshold] [--weight-top] [--weight-distance]
               [--apply-output] [-v]
               {cluster,model,search,learn,apply,easy} ...

Named Arguments

--nested-output

Enable nested output directory structure: {save_dir}/{alphabet}/{k}.

Default: False

--no-nested-output

Disable nested output directory structure (flat output layout).

Default: False

--score-scaler

Enable k-mer score scaling (applies configured scaler to family scores).

Default: True

--no-score-scaler

Disable k-mer score scaling.

Default: True

--cluster-compute-full-tree

Compute full tree for hierarchical clustering (agglomerative).

Default: True

--no-cluster-compute-full-tree

Do not compute full tree for hierarchical clustering (agglomerative).

Default: True

--cluster-plots

Generate plots illustrating clustering results.

Default: False

--no-cluster-plots

Do not generate clustering plots.

Default: False

--cluster-save-matrix

Save distance matrices (BSF). Not recommended for large datasets.

Default: False

--no-cluster-save-matrix

Do not save distance matrices (BSF).

Default: False

--save-apply-associations

Save large optional outputs containing all cosine similarity scores (increases storage substantially).

Default: False

--no-save-apply-associations

Do not save large optional cosine similarity outputs.

Default: False

-v, --version

Print version and exit.

Snakemake arguments (passed through to Snakemake)

--dry-run, --dryrun, -n

Do not execute anything, and display what would be done. If you have a very large workflow, use –dry-run –quiet to just print a summary of the DAG of jobs.

Default: False

--configfile

Specify or overwrite workflow config file(s). Multiple files overwrite each other in the given order. Values are available via Snakemake’s global config dictionary.

-C, --config

Set or overwrite values in the workflow config object (Snakemake –config KEY=VALUE).

--unlock

Unlock the working directory.

Default: False

--until, -U

Run the workflow until it reaches the specified rules or files.

--keepgoing, --keep-going, -k

Continue with independent jobs if a job fails.

Default: False

--latency, -w, --output-wait, --latency-wait

Wait given seconds for output files to appear after job completion (filesystem latency).

Default: 30

--touch, -t

Touch output files instead of running commands (mark as up-to-date).

Default: False

--cores, -c

Use at most N CPU cores/jobs in parallel.

Default: 2

--count

Number of files to process (limits DAG size).

--countstart

Starting file index (for use with –count).

Default: 0

--verbose

Show additional debug output.

Default: False

--quiet, -q

Possible choices: progress, rules, all

Reduce Snakemake output (progress/rules/all). If used without args, quiets progress and rules.

--directory, -d

Specify working directory (relative paths in the Snakefile use this origin).

--forcerun, -R

Force re-execution/creation of the given rules or files.

--list-code-changes, --lc

List output files for which the rule body changed in the Snakefile.

Default: False

--list-params-changes, --lp

List output files for which defined params changed in the Snakefile.

Default: False

--scheduler

Possible choices: greedy, ilp

Snakemake scheduler plugin to use.

--scheduler-ilp-solver

MILP solver to use with the ILP scheduler.

--scheduler-ilp-solver-path

PATH to search for ILP scheduler solver binaries.

Snekmer configfile behavior

--no-default-configfile

Do not auto-load ./config.yaml (or <DIR>/config.yaml when using -d).

Default: False

Snakemake cluster execution (passed through)

--clust

Path to cluster execution YAML configuration file (e.g., for SLURM).

-j, --jobs

Number of simultaneous jobs to submit to the scheduler.

Default: 1000

Snekmer parameters (all modes; defaults match config.yaml)

--k

K-mer length.

Default: 8

--alphabet

Reduced alphabet encoding (0–5, alphabet name, or ‘None’). See alphabets list below.

Default: “2”

--input-dir

Directory containing input sequence files.

Default: “input”

--input-file-exts

File extensions to consider valid input sequence files (space-separated).

Default: [‘fasta’, ‘fna’, ‘faa’, ‘fa’]

--input-file-regex

Regular expression for parsing family/annotation identifiers from filenames.

Default: “.*”

Snekmer Model and Search Parameters: scoring

--score-scaler-n

Scaler keyword argument ‘n’ (passed to the k-mer scaler).

Default: 0.25

--score-labels

If None, uses default k-mer label set for scaler. Otherwise uses provided value (string or JSON).

--score-lname

Label name (e.g., “family”).

Snekmer Cluster Parameters: clustering

--cluster-method

Clustering method (options include “kmeans”, “agglomerative”, “correlation”, “density”, “birch”, “optics”, “hdbscan”).

Default: “agglomerative-jaccard”

--cluster-n-clusters

Number of clusters (int) or ‘None’ (method-dependent).

--cluster-linkage

Linkage method for agglomerative clustering (e.g. “average”).

Default: “average”

--cluster-distance-threshold

Distance threshold for agglomerative clustering (method-dependent).

Default: 0.92

--cluster-min-rep

Minimum repetition threshold for kmers (int) or ‘None’. Kmers below this are discarded.

--cluster-max-rep

Maximum repetition threshold for kmers (int) or ‘None’. Kmers above this are discarded.

--cluster-dist-thresh

Distance threshold for BSF matrix.

Default: 100

Snekmer Model Parameters: model training

--model-cv

Number of cross-validation folds for model evaluation.

Default: 5

--model-random-state

Random state for model evaluation (int) or ‘None’.

Snekmer Search Parameters: search inputs

--model-dir

Directory containing model object(s) (.model).

Default: “output/model/”

--basis-dir

Directory containing k-mer basis set(s) (.kmers).

Default: “output/example-model/”

--score-dir

Directory containing scoring object(s) (.scorer).

Default: “output/scoring/”

Snekmer Learn and Apply Parameters:

--conf-weight-modifier

Weighting modifier for updating confidence when adding data to an existing k-mer count matrix.

Default: 20

--fragmentation

Enable training-data fragmentation (default False).

Default: False

--fragment-version

Fragment length interpretation: ‘absolute’ or ‘percent’.

Default: “absolute”

--frag-length

Fragment length (units depend on –fragment-version).

Default: 50

--min-length

Minimum fragment length to retain; shorter fragments are discarded.

Default: 50

--fragment-location

Fragment location: ‘start’, ‘end’, or ‘random’.

Default: “random”

--seed

Random seed for reproducible fragmentation.

Default: 999

--selection

Possible choices: top_hit, greatest_distance, combined_distance

Annotation selection method.

Default: “top_hit”

--threshold

Family-specific threshold used for prediction filtering (e.g. ‘Median’, ‘Mean’, ‘90th Percentile’, or ‘None’).

Default: “Median”

--weight-top

Weight for ‘top_hit’ when selection method is ‘combined_distance’.

Default: 0.7

--weight-distance

Weight for ‘greatest_distance’ when selection method is ‘combined_distance’.

Default: 0.3

--apply-output

Output filename for apply results in single-file format.

Default: “snekmer_results.csv”

mode

Snekmer mode (cluster, model, search, learn, apply, easy).

mode

Possible choices: cluster, model, search, learn, apply, easy

Sub-commands

cluster

Unsupervised clustering workflow.

snekmer cluster [options]

Named Arguments

--nested-output

Enable nested output directory structure: {save_dir}/{alphabet}/{k}.

Default: False

--no-nested-output

Disable nested output directory structure (flat output layout).

Default: False

--cluster-compute-full-tree

Compute full tree for hierarchical clustering (agglomerative).

Default: True

--no-cluster-compute-full-tree

Do not compute full tree for hierarchical clustering (agglomerative).

Default: True

--cluster-plots

Generate plots illustrating clustering results.

Default: False

--no-cluster-plots

Do not generate clustering plots.

Default: False

--cluster-save-matrix

Save distance matrices (BSF). Not recommended for large datasets.

Default: False

--no-cluster-save-matrix

Do not save distance matrices (BSF).

Default: False

Snakemake arguments (passed through to Snakemake)

--dry-run, --dryrun, -n

Do not execute anything, and display what would be done. If you have a very large workflow, use –dry-run –quiet to just print a summary of the DAG of jobs.

Default: False

--configfile

Specify or overwrite workflow config file(s). Multiple files overwrite each other in the given order. Values are available via Snakemake’s global config dictionary.

-C, --config

Set or overwrite values in the workflow config object (Snakemake –config KEY=VALUE).

--unlock

Unlock the working directory.

Default: False

--until, -U

Run the workflow until it reaches the specified rules or files.

--keepgoing, --keep-going, -k

Continue with independent jobs if a job fails.

Default: False

--latency, -w, --output-wait, --latency-wait

Wait given seconds for output files to appear after job completion (filesystem latency).

Default: 30

--touch, -t

Touch output files instead of running commands (mark as up-to-date).

Default: False

--cores, -c

Use at most N CPU cores/jobs in parallel.

Default: 2

--count

Number of files to process (limits DAG size).

--countstart

Starting file index (for use with –count).

Default: 0

--verbose

Show additional debug output.

Default: False

--quiet, -q

Possible choices: progress, rules, all

Reduce Snakemake output (progress/rules/all). If used without args, quiets progress and rules.

--directory, -d

Specify working directory (relative paths in the Snakefile use this origin).

--forcerun, -R

Force re-execution/creation of the given rules or files.

--list-code-changes, --lc

List output files for which the rule body changed in the Snakefile.

Default: False

--list-params-changes, --lp

List output files for which defined params changed in the Snakefile.

Default: False

--scheduler

Possible choices: greedy, ilp

Snakemake scheduler plugin to use.

--scheduler-ilp-solver

MILP solver to use with the ILP scheduler.

--scheduler-ilp-solver-path

PATH to search for ILP scheduler solver binaries.

Snekmer configfile behavior

--no-default-configfile

Do not auto-load ./config.yaml (or <DIR>/config.yaml when using -d).

Default: False

Snakemake cluster execution (passed through)

--clust

Path to cluster execution YAML configuration file (e.g., for SLURM).

-j, --jobs

Number of simultaneous jobs to submit to the scheduler.

Default: 1000

Snekmer parameters (all modes; defaults match config.yaml)

--k

K-mer length.

Default: 8

--alphabet

Reduced alphabet encoding (0–5, alphabet name, or ‘None’). See alphabets list below.

Default: “2”

--input-dir

Directory containing input sequence files.

Default: “input”

--input-file-exts

File extensions to consider valid input sequence files (space-separated).

Default: [‘fasta’, ‘fna’, ‘faa’, ‘fa’]

--input-file-regex

Regular expression for parsing family/annotation identifiers from filenames.

Default: “.*”

Snekmer Cluster Parameters: clustering

--cluster-method

Clustering method (options include “kmeans”, “agglomerative”, “correlation”, “density”, “birch”, “optics”, “hdbscan”).

Default: “agglomerative-jaccard”

--cluster-n-clusters

Number of clusters (int) or ‘None’ (method-dependent).

--cluster-linkage

Linkage method for agglomerative clustering (e.g. “average”).

Default: “average”

--cluster-distance-threshold

Distance threshold for agglomerative clustering (method-dependent).

Default: 0.92

--cluster-min-rep

Minimum repetition threshold for kmers (int) or ‘None’. Kmers below this are discarded.

--cluster-max-rep

Maximum repetition threshold for kmers (int) or ‘None’. Kmers above this are discarded.

--cluster-dist-thresh

Distance threshold for BSF matrix.

Default: 100

model

Train supervised models + cross-validation reports.

snekmer model [options]

Named Arguments

--nested-output

Enable nested output directory structure: {save_dir}/{alphabet}/{k}.

Default: False

--no-nested-output

Disable nested output directory structure (flat output layout).

Default: False

--score-scaler

Enable k-mer score scaling (applies configured scaler to family scores).

Default: True

--no-score-scaler

Disable k-mer score scaling.

Default: True

Snakemake arguments (passed through to Snakemake)

--dry-run, --dryrun, -n

Do not execute anything, and display what would be done. If you have a very large workflow, use –dry-run –quiet to just print a summary of the DAG of jobs.

Default: False

--configfile

Specify or overwrite workflow config file(s). Multiple files overwrite each other in the given order. Values are available via Snakemake’s global config dictionary.

-C, --config

Set or overwrite values in the workflow config object (Snakemake –config KEY=VALUE).

--unlock

Unlock the working directory.

Default: False

--until, -U

Run the workflow until it reaches the specified rules or files.

--keepgoing, --keep-going, -k

Continue with independent jobs if a job fails.

Default: False

--latency, -w, --output-wait, --latency-wait

Wait given seconds for output files to appear after job completion (filesystem latency).

Default: 30

--touch, -t

Touch output files instead of running commands (mark as up-to-date).

Default: False

--cores, -c

Use at most N CPU cores/jobs in parallel.

Default: 2

--count

Number of files to process (limits DAG size).

--countstart

Starting file index (for use with –count).

Default: 0

--verbose

Show additional debug output.

Default: False

--quiet, -q

Possible choices: progress, rules, all

Reduce Snakemake output (progress/rules/all). If used without args, quiets progress and rules.

--directory, -d

Specify working directory (relative paths in the Snakefile use this origin).

--forcerun, -R

Force re-execution/creation of the given rules or files.

--list-code-changes, --lc

List output files for which the rule body changed in the Snakefile.

Default: False

--list-params-changes, --lp

List output files for which defined params changed in the Snakefile.

Default: False

--scheduler

Possible choices: greedy, ilp

Snakemake scheduler plugin to use.

--scheduler-ilp-solver

MILP solver to use with the ILP scheduler.

--scheduler-ilp-solver-path

PATH to search for ILP scheduler solver binaries.

Snekmer configfile behavior

--no-default-configfile

Do not auto-load ./config.yaml (or <DIR>/config.yaml when using -d).

Default: False

Snakemake cluster execution (passed through)

--clust

Path to cluster execution YAML configuration file (e.g., for SLURM).

-j, --jobs

Number of simultaneous jobs to submit to the scheduler.

Default: 1000

Snekmer parameters (all modes; defaults match config.yaml)

--k

K-mer length.

Default: 8

--alphabet

Reduced alphabet encoding (0–5, alphabet name, or ‘None’). See alphabets list below.

Default: “2”

--input-dir

Directory containing input sequence files.

Default: “input”

--input-file-exts

File extensions to consider valid input sequence files (space-separated).

Default: [‘fasta’, ‘fna’, ‘faa’, ‘fa’]

--input-file-regex

Regular expression for parsing family/annotation identifiers from filenames.

Default: “.*”

Snekmer Model and Search Parameters: scoring

--score-scaler-n

Scaler keyword argument ‘n’ (passed to the k-mer scaler).

Default: 0.25

--score-labels

If None, uses default k-mer label set for scaler. Otherwise uses provided value (string or JSON).

--score-lname

Label name (e.g., “family”).

Snekmer Model Parameters: model training

--model-cv

Number of cross-validation folds for model evaluation.

Default: 5

--model-random-state

Random state for model evaluation (int) or ‘None’.

learn

Build annotation-associated k-mer distributions + confidence evaluation.

snekmer learn [options]

Named Arguments

--nested-output

Enable nested output directory structure: {save_dir}/{alphabet}/{k}.

Default: False

--no-nested-output

Disable nested output directory structure (flat output layout).

Default: False

--save-apply-associations

Save large optional outputs containing all cosine similarity scores (increases storage substantially).

Default: False

--no-save-apply-associations

Do not save large optional cosine similarity outputs.

Default: False

Snakemake arguments (passed through to Snakemake)

--dry-run, --dryrun, -n

Do not execute anything, and display what would be done. If you have a very large workflow, use –dry-run –quiet to just print a summary of the DAG of jobs.

Default: False

--configfile

Specify or overwrite workflow config file(s). Multiple files overwrite each other in the given order. Values are available via Snakemake’s global config dictionary.

-C, --config

Set or overwrite values in the workflow config object (Snakemake –config KEY=VALUE).

--unlock

Unlock the working directory.

Default: False

--until, -U

Run the workflow until it reaches the specified rules or files.

--keepgoing, --keep-going, -k

Continue with independent jobs if a job fails.

Default: False

--latency, -w, --output-wait, --latency-wait

Wait given seconds for output files to appear after job completion (filesystem latency).

Default: 30

--touch, -t

Touch output files instead of running commands (mark as up-to-date).

Default: False

--cores, -c

Use at most N CPU cores/jobs in parallel.

Default: 2

--count

Number of files to process (limits DAG size).

--countstart

Starting file index (for use with –count).

Default: 0

--verbose

Show additional debug output.

Default: False

--quiet, -q

Possible choices: progress, rules, all

Reduce Snakemake output (progress/rules/all). If used without args, quiets progress and rules.

--directory, -d

Specify working directory (relative paths in the Snakefile use this origin).

--forcerun, -R

Force re-execution/creation of the given rules or files.

--list-code-changes, --lc

List output files for which the rule body changed in the Snakefile.

Default: False

--list-params-changes, --lp

List output files for which defined params changed in the Snakefile.

Default: False

--scheduler

Possible choices: greedy, ilp

Snakemake scheduler plugin to use.

--scheduler-ilp-solver

MILP solver to use with the ILP scheduler.

--scheduler-ilp-solver-path

PATH to search for ILP scheduler solver binaries.

Snekmer configfile behavior

--no-default-configfile

Do not auto-load ./config.yaml (or <DIR>/config.yaml when using -d).

Default: False

Snakemake cluster execution (passed through)

--clust

Path to cluster execution YAML configuration file (e.g., for SLURM).

-j, --jobs

Number of simultaneous jobs to submit to the scheduler.

Default: 1000

Snekmer parameters (all modes; defaults match config.yaml)

--k

K-mer length.

Default: 8

--alphabet

Reduced alphabet encoding (0–5, alphabet name, or ‘None’). See alphabets list below.

Default: “2”

--input-dir

Directory containing input sequence files.

Default: “input”

--input-file-exts

File extensions to consider valid input sequence files (space-separated).

Default: [‘fasta’, ‘fna’, ‘faa’, ‘fa’]

--input-file-regex

Regular expression for parsing family/annotation identifiers from filenames.

Default: “.*”

Snekmer Learn and Apply Parameters:

--conf-weight-modifier

Weighting modifier for updating confidence when adding data to an existing k-mer count matrix.

Default: 20

--fragmentation

Enable training-data fragmentation (default False).

Default: False

--fragment-version

Fragment length interpretation: ‘absolute’ or ‘percent’.

Default: “absolute”

--frag-length

Fragment length (units depend on –fragment-version).

Default: 50

--min-length

Minimum fragment length to retain; shorter fragments are discarded.

Default: 50

--fragment-location

Fragment location: ‘start’, ‘end’, or ‘random’.

Default: “random”

--seed

Random seed for reproducible fragmentation.

Default: 999

--selection

Possible choices: top_hit, greatest_distance, combined_distance

Annotation selection method.

Default: “top_hit”

--threshold

Family-specific threshold used for prediction filtering (e.g. ‘Median’, ‘Mean’, ‘90th Percentile’, or ‘None’).

Default: “Median”

--weight-top

Weight for ‘top_hit’ when selection method is ‘combined_distance’.

Default: 0.7

--weight-distance

Weight for ‘greatest_distance’ when selection method is ‘combined_distance’.

Default: 0.3

--apply-output

Output filename for apply results in single-file format.

Default: “snekmer_results.csv”

apply

Predict annotations using outputs from learn.

snekmer apply [options]

Named Arguments

--nested-output

Enable nested output directory structure: {save_dir}/{alphabet}/{k}.

Default: False

--no-nested-output

Disable nested output directory structure (flat output layout).

Default: False

--save-apply-associations

Save large optional outputs containing all cosine similarity scores (increases storage substantially).

Default: False

--no-save-apply-associations

Do not save large optional cosine similarity outputs.

Default: False

Snakemake arguments (passed through to Snakemake)

--dry-run, --dryrun, -n

Do not execute anything, and display what would be done. If you have a very large workflow, use –dry-run –quiet to just print a summary of the DAG of jobs.

Default: False

--configfile

Specify or overwrite workflow config file(s). Multiple files overwrite each other in the given order. Values are available via Snakemake’s global config dictionary.

-C, --config

Set or overwrite values in the workflow config object (Snakemake –config KEY=VALUE).

--unlock

Unlock the working directory.

Default: False

--until, -U

Run the workflow until it reaches the specified rules or files.

--keepgoing, --keep-going, -k

Continue with independent jobs if a job fails.

Default: False

--latency, -w, --output-wait, --latency-wait

Wait given seconds for output files to appear after job completion (filesystem latency).

Default: 30

--touch, -t

Touch output files instead of running commands (mark as up-to-date).

Default: False

--cores, -c

Use at most N CPU cores/jobs in parallel.

Default: 2

--count

Number of files to process (limits DAG size).

--countstart

Starting file index (for use with –count).

Default: 0

--verbose

Show additional debug output.

Default: False

--quiet, -q

Possible choices: progress, rules, all

Reduce Snakemake output (progress/rules/all). If used without args, quiets progress and rules.

--directory, -d

Specify working directory (relative paths in the Snakefile use this origin).

--forcerun, -R

Force re-execution/creation of the given rules or files.

--list-code-changes, --lc

List output files for which the rule body changed in the Snakefile.

Default: False

--list-params-changes, --lp

List output files for which defined params changed in the Snakefile.

Default: False

--scheduler

Possible choices: greedy, ilp

Snakemake scheduler plugin to use.

--scheduler-ilp-solver

MILP solver to use with the ILP scheduler.

--scheduler-ilp-solver-path

PATH to search for ILP scheduler solver binaries.

Snekmer configfile behavior

--no-default-configfile

Do not auto-load ./config.yaml (or <DIR>/config.yaml when using -d).

Default: False

Snakemake cluster execution (passed through)

--clust

Path to cluster execution YAML configuration file (e.g., for SLURM).

-j, --jobs

Number of simultaneous jobs to submit to the scheduler.

Default: 1000

Snekmer parameters (all modes; defaults match config.yaml)

--k

K-mer length.

Default: 8

--alphabet

Reduced alphabet encoding (0–5, alphabet name, or ‘None’). See alphabets list below.

Default: “2”

--input-dir

Directory containing input sequence files.

Default: “input”

--input-file-exts

File extensions to consider valid input sequence files (space-separated).

Default: [‘fasta’, ‘fna’, ‘faa’, ‘fa’]

--input-file-regex

Regular expression for parsing family/annotation identifiers from filenames.

Default: “.*”

Snekmer Learn and Apply Parameters:

--conf-weight-modifier

Weighting modifier for updating confidence when adding data to an existing k-mer count matrix.

Default: 20

--fragmentation

Enable training-data fragmentation (default False).

Default: False

--fragment-version

Fragment length interpretation: ‘absolute’ or ‘percent’.

Default: “absolute”

--frag-length

Fragment length (units depend on –fragment-version).

Default: 50

--min-length

Minimum fragment length to retain; shorter fragments are discarded.

Default: 50

--fragment-location

Fragment location: ‘start’, ‘end’, or ‘random’.

Default: “random”

--seed

Random seed for reproducible fragmentation.

Default: 999

--selection

Possible choices: top_hit, greatest_distance, combined_distance

Annotation selection method.

Default: “top_hit”

--threshold

Family-specific threshold used for prediction filtering (e.g. ‘Median’, ‘Mean’, ‘90th Percentile’, or ‘None’).

Default: “Median”

--weight-top

Weight for ‘top_hit’ when selection method is ‘combined_distance’.

Default: 0.7

--weight-distance

Weight for ‘greatest_distance’ when selection method is ‘combined_distance’.

Default: 0.3

--apply-output

Output filename for apply results in single-file format.

Default: “snekmer_results.csv”

easy

Guided front-end that runs learn then apply end-to-end.

Prompts for training sequences, query sequences, and annotation style, then builds a self-contained workspace and runs both pipeline steps. All prompts can be skipped by supplying the corresponding flags.

snekmer easy [options]

Input / output

--train

Path to training sequences (FASTA file or directory of FASTA files). If omitted, the wizard will prompt for it.

--query

Path to query sequences to annotate (FASTA file or directory). If omitted, the wizard will prompt for it.

--output

Output directory for the workspace. If omitted, the wizard will prompt.

Annotation (choose one)

--ann

Path to an existing annotation file (.ann). Format: tab-separated with columns ‘id’ and ‘family’.

--create-ann

Generate annotations from training FASTA headers. Requires headers in the format: >db|FAMILY_LABEL|seqid description (the field between the first pair of | | becomes the family label).

Default: False

K-mer parameters

--k

K-mer length.

Default: 8

--alphabet

Reduced alphabet encoding (0–5, alphabet name, or ‘None’). 2 = solvacc (3-letter). See alphabets list below.

Default: “2”

Learn / apply options

--selection

Possible choices: top_hit, greatest_distance, combined_distance

Annotation selection method {top_hit, greatest_distance, combined_distance}.

Default: “top_hit”

--threshold

Family-specific score threshold for prediction filtering. Options: ‘Median’, ‘Mean’, ‘90th Percentile’, ‘None’.

Default: “Median”

--apply-output

Output filename for apply results.

Default: “snekmer_results.csv”

Fragmentation (advanced)

--fragmentation

Split sequences into fragments before kmerization.

Default: False

--frag-length

Fragment length in residues (default: 50).

--min-length

Minimum sequence length to fragment (default: 50).

--fragment-version

Fragmentation version (default: absolute).

--fragment-location

Fragment location method (default: random).

--seed

Random seed for fragmentation (default: 999).

Snakemake options

--cores, -c

CPU cores to use.

Default: 2

--dry-run, -n

Show what would be done without executing.

Default: False

--verbose

Show additional Snakemake debug output.

Default: False

--quiet, -q

Possible choices: progress, rules, all

Reduce Snakemake output.

Miscellaneous

--copy-files

Copy input files into the workspace instead of symlinking them (useful when the workspace will be moved or shared).

Default: False

Alphabets (k-mer recoding):

0: hydro (size 2) — 2-value hydrophobicity alphabet 1: standard (size 7) — “Standard” reduction alphabet 2: solvacc (size 3) — Solvent accessibility alphabet 3: hydrocharge (size 3) — 2-value hydrophobicity with charged residues as a third category 4: hydrostruct (size 3) — 2-value hydrophobicity with structural-breakers as a third category 5: miqs (size 10) — MIQS alphabet3 None: None (size 20) — No reduced alphabet

You may pass either an integer (0–5) or the alphabet name (e.g. ‘hydro’), or ‘None’.

Alphabets (k-mer recoding):

0: hydro (size 2) — 2-value hydrophobicity alphabet 1: standard (size 7) — “Standard” reduction alphabet 2: solvacc (size 3) — Solvent accessibility alphabet 3: hydrocharge (size 3) — 2-value hydrophobicity with charged residues as a third category 4: hydrostruct (size 3) — 2-value hydrophobicity with structural-breakers as a third category 5: miqs (size 10) — MIQS alphabet3 None: None (size 20) — No reduced alphabet

You may pass either an integer (0–5) or the alphabet name (e.g. ‘hydro’), or ‘None’.