Setting up User Configuration
To run Snekmer, the user must specify parameters in a configuration
file (.YAML). A template config.yaml
file is included in the
resources directory.
The example YAML files included are:
config.yaml
: Configuration file for running Snekmerclust.yaml
: (optional) Cluster configuration file for deploying Snekmer on a high-performance computing (HPC) cluster
Parameter Descriptions for config.yaml
The base config.yaml file is required in order to run snekmer model or snekmer cluster.
Required Parameters
Parameters which are required to be specified by the user in order to use Snekmer.
Parameter |
Type |
Description |
---|---|---|
|
|
K-mer length |
|
|
Reduced alphabet encoding (see Alphabets for more details). Alphabets may be specified by numbers 0-5 or by their names. |
Input/Output Parameters
General parameters related to input and output sequences and/or files.
Parameter |
Type |
Description |
---|---|---|
|
|
File extensions to be considered as valid for input sequence files |
|
|
Regular expression for parsing family/annotation identifiers from filenames |
|
|
If True, saves files into nested directory structure, i.e. {save_dir}/{alphabet}/{k} |
Score Parameters
General parameters related to how Snekmer calculates family scores for k-mers.
Parameter |
Type |
Description |
---|---|---|
|
|
Keyword arguments to pass to k-mer scaler object |
|
|
If None, uses default kmer set for scaler. Otherwise, uses the ones specified |
|
|
Label name (e.g. |
Model Parameters
General parameters related to Snekmer’s model mode (snekmer model
), wherein supervised models are trained via the workflow.
Parameter |
Type |
Description |
---|---|---|
|
|
Number of cross-validation folds for model evaluation |
|
|
Random state for model evaluation |
Cluster Parameters
General parameters related to Snekmer’s cluster mode (snekmer cluster
), wherein unsupervised clusters are produced via the workflow.
Parameter |
Type |
Description |
---|---|---|
|
|
Clustering method (options: |
|
|
Parameters to pass to the clustering algorithm |
|
|
If True, generates plots illustrating clustering results |
|
|
Threshold for the minimum number of repetitions of a kmer within a set. Kmers that do not meet this threshold are discarded. |
|
|
Threshold for the maximum number of repetitions of a kmer within a set. Kmers that do not meet this threshold are discarded. |
|
|
If True, saves distance matrices (BSF). Not recommended for large datasets. |
|
|
Distance threshold for BSF matrix |
Parameter Descriptions for clust.yaml
See SLURM documentation for more information on cluster parameters.
Required Parameters for Snekmer Search
The following parameters are required in your config file for snekmer search.
Parameter |
Type |
Description |
---|---|---|
|
|
|
|
|
|
|
|
Directory containing model object(s) (.model) |
|
|
Directory containing k-mer basis set(s) (.kmers) |
|
|
Directory containing scoring object(s) (.scorer) |
|
|
|
|
|
|
|
|
Learn/Apply Parameters
General parameters related to Snekmer’s learn and apply mode (snekmer learn
, snekmer apply
) , wherein supervised models are trained via the workflow.
Parameter |
Type |
Description |
---|---|---|
|
|
Save large optional output files containing all generated cosine similarity scores. |
|
|
Weighting modifer for updating confidence when adding data to an existing kmer count matrix. |
|
|
Option to fragment training data with multiple sub-options listed below. |
|
|
Choose ‘absolute’ or ‘percent’. An absolute length of 50 would be 50 amino acids long. |
|
|
Length of fragment. Depending on “version”, this is a percent or absolute length. |
|
|
Minimum length of fragment that should be retained. Values less than this are discarded. |
|
|
Choose ‘start’, ‘end’, or ‘random’. This is where on a sequence a fragment is taken from. |
|
|
Choose any (random) seed for reproducible fragmentation. |