Command Line Interface

To run any of the five Snekmer operation modes, simply call:

snekmer {mode}

Each mode has its own mode-specific options and parameters to be specified on the command line or the config.yaml file, respectively.

For an overview of Snekmer usage, reference the help command (snekmer --help).

$ snekmer --help
usage: snekmer [-h] [-v] {cluster,model,search,learn,apply} ...

Snekmer: A tool for kmer-based sequence analysis using amino acid reduction (AAR)

options:
-h, --help            show this help message and exit
-v, --version         print version and exit

mode:
Snekmer mode

{cluster,model,search,learn,apply}

Tailored references for the individual operation modes can be accessed via snekmer {mode} --help.

Configuration

To run Snekmer, create a config.yaml file containing desired parameters. A template is included in the repository. Note that a config file must be included, in the same directory as input directory, for Snekmer to operate.

Snekmer assumes that input files are stored in the input directory, and automatically creates an output directory to save all output files. Snekmer also assumes background files, if any, are stored in input/background. An example of the assumed directory structure is shown below:

Snekmer cluster, model, and search input

.
├── config.yaml
├── input/
│   ├── background/
│   │   ├── X.fasta
│   │   ├── Y.fasta
│   │   └── etc.
│   ├── A.fasta
│   ├── B.fasta
│   └── etc.
├── output/
│   ├── ...
│   └── ...

Snekmer learn input

.
├── config.yaml
├── input/
│   ├── A.fasta # known sequences to "learn" kmer counts matrix from
│   ├── B.fasta # known sequences to "learn" kmer counts matrix from
│   └── etc.
│   └── base/  # optional
│      └── base-kmer-counts.csv # optional file to additively merge kmer counts with
├── annotations/
│   └── annotations.ann # annotation files used for predicting future sequences
├── output/
│   ├── ...
│   └── ...

Snekmer apply input

.
├── config.yaml
├── input/
│   ├── A.fasta # unknown sequences to "apply" kmer counts matrix on
│   ├── B.fasta # unknown sequences to "apply" kmer counts matrix on
│   └── etc.
├── counts/
│   └── kmer-counts-total.csv #kmer counts matrix generated in ``learn``
├── confidence/
│   └── global-confidence-scores.csv #global confidence distribution generated in ``learn``
├── output/
│   ├── ...
│   └── ...

Partial Workflow

To execute only a part of the workflow, the --until option can be invoked. For instance, to execute the workflow only through the kmer vector generation step, run:

snekmer {mode} --until vectorize

All Options

Snekmer: A tool for kmer-based sequence analysis using amino acid reduction (AAR).

usage: snekmer [-h] [-v] {cluster,model,search,learn,apply} ...

Named Arguments

-v, --version

Print version and exit.

mode

Snekmer mode (cluster, model, search, learn, or apply).

mode

Possible choices: cluster, model, search, learn, apply

Sub-commands

cluster

Apply unsupervised clustering via Snekmer.

snekmer cluster [-h] [--dry-run] [--configfile PATH [PATH ...]]
                [-C [KEY=VALUE ...]] [--unlock] [--until TARGET [TARGET ...]]
                [--keepgoing] [--latency SECONDS] [--touch] [--cores N]
                [--count N] [--countstart IDX] [--verbose]
                [--quiet [{progress,rules,all} ...]] [--directory DIR]
                [--forcerun [TARGET ...]] [--list-code-changes]
                [--list-params-changes] [--clust PATH [PATH ...]] [-j N]
Named Arguments
--dry-run, --dryrun, -n

Do not execute anything, and display what would be done. If you have a very large workflow, use –dry-run –quiet to just print a summary of the DAG of jobs.

Default: False

--configfile

Specify or overwrite the config file of the workflow (see the docs). Values specified in JSON or YAML format are available in the global config dictionary inside the workflow. Multiple files overwrite each other in the given order. Thereby missing keys in previous config files are extended by following configfiles. Note that this order also includes a config file defined in the workflow definition itself (which will come first).

-C, --config

Set or overwrite values in the workflow config object. The workflow config object is accessible as variable config inside the workflow. Default values can be set by providing a JSON file.

--unlock

unlock directory

Default: False

--until, -U

Runs the pipeline until it reaches the specified rules or files. Only runs jobs that are dependencies of the specified rule or files, does not run sibling DAGs.

--keepgoing, --keep-going, -k

Go on with independent jobs if a job fails.

Default: False

--latency, -w, --output-wait, --latency-wait

Wait given seconds if an output file of a job is not present after the job finished. This helps if your filesystem suffers from latency (default 30).

Default: 30

--touch, -t

Touch output files (mark them up to date without really changing them) instead of running their commands. This is used to pretend that the rules were executed, in order to fool future invocations of snakemake. Fails if a file does not yet exist. Note that this will only touch files that would otherwise be recreated by Snakemake (e.g. because their input files are newer). For enforcing a touch, combine this with –force, –forceall, or –forcerun. Note however that you loose the provenance information when the files have been created in realitiy. Hence, this should be used only as a last resort.

Default: False

--cores, -c

Use at most N CPU cores/jobs in parallel. If N is omitted or ‘all’, the limit is set to the number of available CPU cores. In case of cluster/cloud execution, this argument sets the maximum number of cores requested from the cluster or cloud scheduler. (See https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#resources-remote-execution for more info)This number is available to rules via workflow.cores.

Default: 2

--count

Number of files to process (limits DAG size).

--countstart

Starting file index (for use with –count).

Default: 0

--verbose

Show additional debug output (default False)

Default: False

--quiet, -q

Possible choices: progress, rules, all

Do not output certain information. If used without arguments, do not output any progress or rule information. Defining ‘all’ results in no information being printed at all.

--directory, -d

Specify working directory (relative paths in the snakefile will use this as their origin).

--forcerun, -R

Force the re-execution or creation of the given rules or files. Use this option if you changed a rule and want to have all its output in your workflow updated.

--list-code-changes, --lc

List all output files for which the rule body (run or shell) have changed in the Snakefile.

Default: False

--list-params-changes, --lp

List all output files for which the defined params have changed in the Snakefile.

Default: False

Cluster Execution Arguments
--clust

Path to cluster execution yaml configuration file.

-j, --jobs

Number of simultaneous jobs to submit to a slurm queue.

Default: 1000

model

Train supervised models via Snekmer.

snekmer model [-h] [--dry-run] [--configfile PATH [PATH ...]]
              [-C [KEY=VALUE ...]] [--unlock] [--until TARGET [TARGET ...]]
              [--keepgoing] [--latency SECONDS] [--touch] [--cores N]
              [--count N] [--countstart IDX] [--verbose]
              [--quiet [{progress,rules,all} ...]] [--directory DIR]
              [--forcerun [TARGET ...]] [--list-code-changes]
              [--list-params-changes] [--clust PATH [PATH ...]] [-j N]
Named Arguments
--dry-run, --dryrun, -n

Do not execute anything, and display what would be done. If you have a very large workflow, use –dry-run –quiet to just print a summary of the DAG of jobs.

Default: False

--configfile

Specify or overwrite the config file of the workflow (see the docs). Values specified in JSON or YAML format are available in the global config dictionary inside the workflow. Multiple files overwrite each other in the given order. Thereby missing keys in previous config files are extended by following configfiles. Note that this order also includes a config file defined in the workflow definition itself (which will come first).

-C, --config

Set or overwrite values in the workflow config object. The workflow config object is accessible as variable config inside the workflow. Default values can be set by providing a JSON file.

--unlock

unlock directory

Default: False

--until, -U

Runs the pipeline until it reaches the specified rules or files. Only runs jobs that are dependencies of the specified rule or files, does not run sibling DAGs.

--keepgoing, --keep-going, -k

Go on with independent jobs if a job fails.

Default: False

--latency, -w, --output-wait, --latency-wait

Wait given seconds if an output file of a job is not present after the job finished. This helps if your filesystem suffers from latency (default 30).

Default: 30

--touch, -t

Touch output files (mark them up to date without really changing them) instead of running their commands. This is used to pretend that the rules were executed, in order to fool future invocations of snakemake. Fails if a file does not yet exist. Note that this will only touch files that would otherwise be recreated by Snakemake (e.g. because their input files are newer). For enforcing a touch, combine this with –force, –forceall, or –forcerun. Note however that you loose the provenance information when the files have been created in realitiy. Hence, this should be used only as a last resort.

Default: False

--cores, -c

Use at most N CPU cores/jobs in parallel. If N is omitted or ‘all’, the limit is set to the number of available CPU cores. In case of cluster/cloud execution, this argument sets the maximum number of cores requested from the cluster or cloud scheduler. (See https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#resources-remote-execution for more info)This number is available to rules via workflow.cores.

Default: 2

--count

Number of files to process (limits DAG size).

--countstart

Starting file index (for use with –count).

Default: 0

--verbose

Show additional debug output (default False)

Default: False

--quiet, -q

Possible choices: progress, rules, all

Do not output certain information. If used without arguments, do not output any progress or rule information. Defining ‘all’ results in no information being printed at all.

--directory, -d

Specify working directory (relative paths in the snakefile will use this as their origin).

--forcerun, -R

Force the re-execution or creation of the given rules or files. Use this option if you changed a rule and want to have all its output in your workflow updated.

--list-code-changes, --lc

List all output files for which the rule body (run or shell) have changed in the Snakefile.

Default: False

--list-params-changes, --lp

List all output files for which the defined params have changed in the Snakefile.

Default: False

Cluster Execution Arguments
--clust

Path to cluster execution yaml configuration file.

-j, --jobs

Number of simultaneous jobs to submit to a slurm queue.

Default: 1000

learn

Learn kmer-annotation associations via Snekmer

snekmer learn [-h] [--dry-run] [--configfile PATH [PATH ...]]
              [-C [KEY=VALUE ...]] [--unlock] [--until TARGET [TARGET ...]]
              [--keepgoing] [--latency SECONDS] [--touch] [--cores N]
              [--count N] [--countstart IDX] [--verbose]
              [--quiet [{progress,rules,all} ...]] [--directory DIR]
              [--forcerun [TARGET ...]] [--list-code-changes]
              [--list-params-changes] [--clust PATH [PATH ...]] [-j N]
Named Arguments
--dry-run, --dryrun, -n

Do not execute anything, and display what would be done. If you have a very large workflow, use –dry-run –quiet to just print a summary of the DAG of jobs.

Default: False

--configfile

Specify or overwrite the config file of the workflow (see the docs). Values specified in JSON or YAML format are available in the global config dictionary inside the workflow. Multiple files overwrite each other in the given order. Thereby missing keys in previous config files are extended by following configfiles. Note that this order also includes a config file defined in the workflow definition itself (which will come first).

-C, --config

Set or overwrite values in the workflow config object. The workflow config object is accessible as variable config inside the workflow. Default values can be set by providing a JSON file.

--unlock

unlock directory

Default: False

--until, -U

Runs the pipeline until it reaches the specified rules or files. Only runs jobs that are dependencies of the specified rule or files, does not run sibling DAGs.

--keepgoing, --keep-going, -k

Go on with independent jobs if a job fails.

Default: False

--latency, -w, --output-wait, --latency-wait

Wait given seconds if an output file of a job is not present after the job finished. This helps if your filesystem suffers from latency (default 30).

Default: 30

--touch, -t

Touch output files (mark them up to date without really changing them) instead of running their commands. This is used to pretend that the rules were executed, in order to fool future invocations of snakemake. Fails if a file does not yet exist. Note that this will only touch files that would otherwise be recreated by Snakemake (e.g. because their input files are newer). For enforcing a touch, combine this with –force, –forceall, or –forcerun. Note however that you loose the provenance information when the files have been created in realitiy. Hence, this should be used only as a last resort.

Default: False

--cores, -c

Use at most N CPU cores/jobs in parallel. If N is omitted or ‘all’, the limit is set to the number of available CPU cores. In case of cluster/cloud execution, this argument sets the maximum number of cores requested from the cluster or cloud scheduler. (See https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#resources-remote-execution for more info)This number is available to rules via workflow.cores.

Default: 2

--count

Number of files to process (limits DAG size).

--countstart

Starting file index (for use with –count).

Default: 0

--verbose

Show additional debug output (default False)

Default: False

--quiet, -q

Possible choices: progress, rules, all

Do not output certain information. If used without arguments, do not output any progress or rule information. Defining ‘all’ results in no information being printed at all.

--directory, -d

Specify working directory (relative paths in the snakefile will use this as their origin).

--forcerun, -R

Force the re-execution or creation of the given rules or files. Use this option if you changed a rule and want to have all its output in your workflow updated.

--list-code-changes, --lc

List all output files for which the rule body (run or shell) have changed in the Snakefile.

Default: False

--list-params-changes, --lp

List all output files for which the defined params have changed in the Snakefile.

Default: False

Cluster Execution Arguments
--clust

Path to cluster execution yaml configuration file.

-j, --jobs

Number of simultaneous jobs to submit to a slurm queue.

Default: 1000

apply

Apply kmer-annotation associations via Snekmer

snekmer apply [-h] [--dry-run] [--configfile PATH [PATH ...]]
              [-C [KEY=VALUE ...]] [--unlock] [--until TARGET [TARGET ...]]
              [--keepgoing] [--latency SECONDS] [--touch] [--cores N]
              [--count N] [--countstart IDX] [--verbose]
              [--quiet [{progress,rules,all} ...]] [--directory DIR]
              [--forcerun [TARGET ...]] [--list-code-changes]
              [--list-params-changes] [--clust PATH [PATH ...]] [-j N]
Named Arguments
--dry-run, --dryrun, -n

Do not execute anything, and display what would be done. If you have a very large workflow, use –dry-run –quiet to just print a summary of the DAG of jobs.

Default: False

--configfile

Specify or overwrite the config file of the workflow (see the docs). Values specified in JSON or YAML format are available in the global config dictionary inside the workflow. Multiple files overwrite each other in the given order. Thereby missing keys in previous config files are extended by following configfiles. Note that this order also includes a config file defined in the workflow definition itself (which will come first).

-C, --config

Set or overwrite values in the workflow config object. The workflow config object is accessible as variable config inside the workflow. Default values can be set by providing a JSON file.

--unlock

unlock directory

Default: False

--until, -U

Runs the pipeline until it reaches the specified rules or files. Only runs jobs that are dependencies of the specified rule or files, does not run sibling DAGs.

--keepgoing, --keep-going, -k

Go on with independent jobs if a job fails.

Default: False

--latency, -w, --output-wait, --latency-wait

Wait given seconds if an output file of a job is not present after the job finished. This helps if your filesystem suffers from latency (default 30).

Default: 30

--touch, -t

Touch output files (mark them up to date without really changing them) instead of running their commands. This is used to pretend that the rules were executed, in order to fool future invocations of snakemake. Fails if a file does not yet exist. Note that this will only touch files that would otherwise be recreated by Snakemake (e.g. because their input files are newer). For enforcing a touch, combine this with –force, –forceall, or –forcerun. Note however that you loose the provenance information when the files have been created in realitiy. Hence, this should be used only as a last resort.

Default: False

--cores, -c

Use at most N CPU cores/jobs in parallel. If N is omitted or ‘all’, the limit is set to the number of available CPU cores. In case of cluster/cloud execution, this argument sets the maximum number of cores requested from the cluster or cloud scheduler. (See https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#resources-remote-execution for more info)This number is available to rules via workflow.cores.

Default: 2

--count

Number of files to process (limits DAG size).

--countstart

Starting file index (for use with –count).

Default: 0

--verbose

Show additional debug output (default False)

Default: False

--quiet, -q

Possible choices: progress, rules, all

Do not output certain information. If used without arguments, do not output any progress or rule information. Defining ‘all’ results in no information being printed at all.

--directory, -d

Specify working directory (relative paths in the snakefile will use this as their origin).

--forcerun, -R

Force the re-execution or creation of the given rules or files. Use this option if you changed a rule and want to have all its output in your workflow updated.

--list-code-changes, --lc

List all output files for which the rule body (run or shell) have changed in the Snakefile.

Default: False

--list-params-changes, --lp

List all output files for which the defined params have changed in the Snakefile.

Default: False

Cluster Execution Arguments
--clust

Path to cluster execution yaml configuration file.

-j, --jobs

Number of simultaneous jobs to submit to a slurm queue.

Default: 1000