Setting up User Configuration

To run Snekmer, the user must specify parameters in a configuration file (.YAML). A template config.yaml file is included in the resources directory.

The example YAML files included are:

  • config.yaml: Configuration file for running Snekmer

  • clust.yaml: (optional) Cluster configuration file for deploying Snekmer on a high-performance computing (HPC) cluster

Parameter Descriptions for config.yaml

The base config.yaml file is required in order to run snekmer model or snekmer cluster.

Required Parameters

Parameters which are required to be specified by the user in order to use Snekmer.

Parameter

Type

Description

k

int

K-mer length

alphabet

str or int

Reduced alphabet encoding (see Alphabets for more details). Alphabets may be specified by numbers 0-5 or by their names.

Input/Output Parameters

General parameters related to input and output sequences and/or files.

Parameter

Type

Description

input_file_exts

list

File extensions to be considered as valid for input sequence files

input_file_regex

str or None

Regular expression for parsing family/annotation identifiers from filenames

nested_output

bool

If True, saves files into nested directory structure, i.e. {save_dir}/{alphabet}/{k}

Score Parameters

General parameters related to how Snekmer calculates family scores for k-mers.

Parameter

Type

Description

scaler_kwargs

dict

Keyword arguments to pass to k-mer scaler object

labels

str or None

If None, uses default kmer set for scaler. Otherwise, uses the ones specified

lname

str or None

Label name (e.g. "family")

Model Parameters

General parameters related to Snekmer’s model mode (snekmer model), wherein supervised models are trained via the workflow.

Parameter

Type

Description

cv

int

Number of cross-validation folds for model evaluation

random_state

int or None

Random state for model evaluation

Cluster Parameters

General parameters related to Snekmer’s cluster mode (snekmer cluster), wherein unsupervised clusters are produced via the workflow.

Parameter

Type

Description

method

str

Clustering method (options: "kmeans", "agglomerative", "correlation", "density", "birch", "optics", or "hdbscan")

params

dict

Parameters to pass to the clustering algorithm

cluster_plots

bool

If True, generates plots illustrating clustering results

min_rep

int or None

Threshold for the minimum number of repetitions of a kmer within a set. Kmers that do not meet this threshold are discarded.

max_rep

int or None

Threshold for the maximum number of repetitions of a kmer within a set. Kmers that do not meet this threshold are discarded.

save_matrix

bool

If True, saves distance matrices (BSF). Not recommended for large datasets.

dist_thresh

int

Distance threshold for BSF matrix

Parameter Descriptions for clust.yaml

See SLURM documentation for more information on cluster parameters.