Setting up User Configuration (config.yaml)

To run Snekmer, the user must specify parameters either in a configuration file (.YAML) or by specifying them as command line arguments using -C or –config. A template config.yaml file is included in the resources directory.

The example YAML files included are:

  • config.yaml: Configuration file for running Snekmer

  • clust.yaml: (optional) Cluster configuration file for deploying Snekmer on a high-performance computing (HPC) cluster

Parameter Descriptions for config.yaml

The base config.yaml file contains the parameters which are required in order to run snekmer model or snekmer cluster. These may alternatively be specified using command line arguments, and Snekmer supports specifying some parameters in a .yaml file (either config.yaml or specified using –configfile when invoking Snekmer) and others using -C or –config arguments.

Required Parameters

Parameters which are required to be specified by the user in order to use Snekmer.

Parameter

Type

Description

k

int

K-mer length

alphabet

str or int

Reduced alphabet encoding (see Alphabets for more details). Alphabets may be specified by numbers 0-5 or by their names.

Input/Output Parameters

General parameters related to input and output sequences and/or files.

Parameter

Type

Description

input_dir

str

Directory containing input FASTA files (default: input)

input_file_exts

list

File extensions to be considered as valid for input sequence files

input_file_regex

str or None

Regular expression for parsing family/annotation identifiers from filenames

nested_output

bool

If True, saves files into nested directory structure, i.e. {save_dir}/{alphabet}/{k}

Score Parameters

General parameters related to how Snekmer calculates family scores for k-mers.

Parameter

Type

Description

scaler

bool

If True, applies k-mer frequency scaling before scoring

scaler_kwargs

dict

Keyword arguments to pass to k-mer scaler object (e.g. {"n": 0.25})

labels

str or None

If None, uses default kmer set for scaler. Otherwise, uses the ones specified

lname

str or None

Label name (e.g. "family")

Model Parameters

General parameters related to Snekmer’s model mode (snekmer model), wherein supervised models are trained via the workflow.

Parameter

Type

Description

cv

int

Number of cross-validation folds for model evaluation

random_state

int or None

Random state for model evaluation

Cluster Parameters

General parameters related to Snekmer’s cluster mode (snekmer cluster), wherein unsupervised clusters are produced via the workflow.

Parameter

Type

Description

method

str

Clustering algorithm. See table below for all options.

params

dict

Parameters to pass to the clustering algorithm

cluster_plots

bool

If True, generates figures illustrating clustering results (t-SNE, UMAP, PCA)

min_rep

int or None

Discard k-mers with fewer than this many occurrences across the input set

max_rep

int or None

Discard k-mers with more than this many occurrences across the input set

save_matrix

bool

If True, saves the pairwise distance matrix (large files; not recommended for large datasets)

dist_thresh

int

Distance threshold used when computing the BSF Jaccard matrix

Clustering methods (cluster.method)

Value

Description

agglomerative-jaccard

Agglomerative clustering using Jaccard distance (default; requires BSF or falls back to scipy)

density-jaccard

DBSCAN density clustering using Jaccard distance (requires BSF or falls back to scipy)

hdensity-jaccard

HDBSCAN density clustering using Jaccard distance (requires BSF or falls back to scipy)

agglomerative

Agglomerative clustering using Euclidean distance

kmeans

Mini-batch k-means clustering

correlation

Hierarchical clustering using correlation distance

density

DBSCAN density-based clustering

birch

Birch incremental clustering

optics

OPTICS density-based clustering

hdbscan

HDBSCAN hierarchical density clustering

The three -jaccard methods use the Blazing Signature Filter (BSF) when installed, and fall back to scipy.spatial.distance.pdist automatically when BSF is not available.

Parameter Descriptions for clust.yaml

clust.yaml is an optional configuration file used to deploy Snekmer jobs on a high-performance computing (HPC) cluster via SLURM (or another scheduler supported by Snakemake). It is not required for local runs.

A typical clust.yaml specifies resource requests per rule, for example:

__default__:
  partition: normal
  time: "04:00:00"
  mem: "16G"
  ntasks: 1
  cpus-per-task: 4

vectorize:
  time: "01:00:00"
  mem: "8G"

Pass it to Snekmer with the --clust flag:

snekmer cluster --clust clust.yaml

See the Snakemake cluster execution documentation and SLURM sbatch documentation for the full list of supported fields.