Setting up User Configuration (config.yaml)
To run Snekmer, the user must specify parameters either in a configuration
file (.YAML) or by specifying them as command line arguments using -C or –config. A template config.yaml file is included in the
resources directory.
The example YAML files included are:
config.yaml: Configuration file for running Snekmerclust.yaml: (optional) Cluster configuration file for deploying Snekmer on a high-performance computing (HPC) cluster
Parameter Descriptions for config.yaml
The base config.yaml file contains the parameters which are required in order to run snekmer model or snekmer cluster. These may alternatively be specified using command line arguments, and Snekmer supports specifying some parameters in a .yaml file (either config.yaml or specified using –configfile when invoking Snekmer) and others using -C or –config arguments.
Required Parameters
Parameters which are required to be specified by the user in order to use Snekmer.
Parameter |
Type |
Description |
|---|---|---|
|
|
K-mer length |
|
|
Reduced alphabet encoding (see Alphabets for more details). Alphabets may be specified by numbers 0-5 or by their names. |
Input/Output Parameters
General parameters related to input and output sequences and/or files.
Parameter |
Type |
Description |
|---|---|---|
|
|
Directory containing input FASTA files (default: |
|
|
File extensions to be considered as valid for input sequence files |
|
|
Regular expression for parsing family/annotation identifiers from filenames |
|
|
If True, saves files into nested directory structure, i.e. |
Score Parameters
General parameters related to how Snekmer calculates family scores for k-mers.
Parameter |
Type |
Description |
|---|---|---|
|
|
If True, applies k-mer frequency scaling before scoring |
|
|
Keyword arguments to pass to k-mer scaler object (e.g. |
|
|
If None, uses default kmer set for scaler. Otherwise, uses the ones specified |
|
|
Label name (e.g. |
Model Parameters
General parameters related to Snekmer’s model mode (snekmer model), wherein supervised models are trained via the workflow.
Parameter |
Type |
Description |
|---|---|---|
|
|
Number of cross-validation folds for model evaluation |
|
|
Random state for model evaluation |
Cluster Parameters
General parameters related to Snekmer’s cluster mode (snekmer cluster), wherein unsupervised clusters are produced via the workflow.
Parameter |
Type |
Description |
|---|---|---|
|
|
Clustering algorithm. See table below for all options. |
|
|
Parameters to pass to the clustering algorithm |
|
|
If True, generates figures illustrating clustering results (t-SNE, UMAP, PCA) |
|
|
Discard k-mers with fewer than this many occurrences across the input set |
|
|
Discard k-mers with more than this many occurrences across the input set |
|
|
If True, saves the pairwise distance matrix (large files; not recommended for large datasets) |
|
|
Distance threshold used when computing the BSF Jaccard matrix |
Clustering methods (cluster.method)
Value |
Description |
|---|---|
|
Agglomerative clustering using Jaccard distance (default; requires BSF or falls back to scipy) |
|
DBSCAN density clustering using Jaccard distance (requires BSF or falls back to scipy) |
|
HDBSCAN density clustering using Jaccard distance (requires BSF or falls back to scipy) |
|
Agglomerative clustering using Euclidean distance |
|
Mini-batch k-means clustering |
|
Hierarchical clustering using correlation distance |
|
DBSCAN density-based clustering |
|
Birch incremental clustering |
|
OPTICS density-based clustering |
|
HDBSCAN hierarchical density clustering |
The three -jaccard methods use the
Blazing Signature Filter (BSF) when installed, and fall back
to scipy.spatial.distance.pdist automatically when BSF is not available.
Parameter Descriptions for clust.yaml
clust.yaml is an optional configuration file used to deploy Snekmer jobs
on a high-performance computing (HPC) cluster via SLURM (or another scheduler
supported by Snakemake). It is not required for local runs.
A typical clust.yaml specifies resource requests per rule, for example:
__default__:
partition: normal
time: "04:00:00"
mem: "16G"
ntasks: 1
cpus-per-task: 4
vectorize:
time: "01:00:00"
mem: "8G"
Pass it to Snekmer with the --clust flag:
snekmer cluster --clust clust.yaml
See the Snakemake cluster execution documentation and SLURM sbatch documentation for the full list of supported fields.
Required Parameters for Snekmer Search
The following parameters must be specified when running snekmer search.
Parameter |
Type |
Description |
|---|---|---|
|
|
|
|
|
|
|
|
Directory containing model object(s) (.model) |
|
|
Directory containing k-mer basis set(s) (.kmers) |
|
|
Directory containing scoring object(s) (.scorer) |
|
|
|
|
|
|
|
|
Learn/Apply Parameters
General parameters related to Snekmer’s learn and apply mode (snekmer learn, snekmer apply) , wherein supervised models are trained via the workflow.
Parameter |
Type |
Description |
|---|---|---|
|
|
Save large optional output files containing all generated cosine similarity scores. |
|
|
Weighting modifer for updating confidence when adding data to an existing kmer count matrix. |
|
|
Option to fragment training data with multiple sub-options listed below. |
|
|
Choose ‘absolute’ or ‘percent’. An absolute length of 50 would be 50 amino acids long. |
|
|
Length of fragment. Depending on “version”, this is a percent or absolute length. |
|
|
Minimum length of fragment that should be retained. Values less than this are discarded. |
|
|
Choose ‘start’, ‘end’, or ‘random’. This is where on a sequence a fragment is taken from. |
|
|
Choose any (random) seed for reproducible fragmentation. |
|
|
The method for selecting an annotation: ‘top_hit’, ‘greatest_distance’, or ‘combined_distance’. |
|
|
A family-specific threshold used for prediction filtering: None, ‘Median’, ‘Mean’, ‘90th Percentile’, etc. |
|
|
When selection method is ‘combined_distance’, this is the weight for the top_hit method. |
|
|
When selection method is ‘combined_distance’, this is the weight for the greatest_distance method. |
|
|
The output name for the apply results in single file format. |