Using Snekmer ============= Snekmer has four modeling operations: ``cluster`` (unsupervised clustering), ``model`` (supervised modeling), ``search`` (application of model to new sequences), and ``motif`` (feature selection). We will call the first two modes **learning modes** due to their utility in learning relationships between protein family input files. Users may choose a mode to best suit their specific use case. Snekmer also has two non-modeling operations: ``learn`` (kmer counts matrix generation), and ``apply`` (perform cosine similarity between sequences and kmer counts matrix). The Learn/Apply pipeline can (and should) be used with with large training datasets to quickly find predict annotations for new sequences. The mode must be specified in the command line, e.g. to specify the ``model`` mode, the following should be called: .. code-block:: bash snekmer model [--options] In the `resources `_, an example configuration file is included: - `config.yaml `_: Configuration file for snekmer execution. .. code-block:: bash snekmer {mode} --dryrun (For instance, in supervised mode, run ``snekmer model --dryrun``.) The output of the dry run shows you the files that will be created by the pipeline. If no files are generated, double-check that your directory structure matches the format specified above. When you are ready to process your files, run: .. code-block:: bash snekmer {mode} .. _usage-results: Accessing Results ----------------- Summary Reports ::::::::::::::: Each step in the Snekmer modeling pipeline will generate a report in HTML format. Users can find these reports, entitled **Snekmer_\_Report.html**, in the output directory. Snekmer Model Output Files :::::::::::::::::::: All operation modes will preprocess input files and kmerize sequences. The associated output files can be found in the respective directories. The following output directories and files will always be created: .. code-block:: console . ├── input/ │ ├── A.fasta │ └── B.fasta ├── output/ │ ├── kmerize/ │ │ ├── A.kmers # kmer labels for A │ │ └── B.kmers # kmer labels for B │ ├── vector/ │ │ ├── A.npz # sequences, sequence IDs, and kmer vectors for A │ │ └── B.npz # sequences, sequence IDs, and kmer vectors for B │ ├── ... Mode-Specific Output Files -------------------------- The steps in the Snekmer pipeline generate their own associated output files. Snekmer Cluster Output Files :::::::::::::::::::::::::::: Snekmer's cluster mode produces the following output files and directories in addition to the files described previously. .. code-block:: console . └── output/ ├── ... └── cluster/ ├── snekmer.csv # Summary of clustering results └── figures/ # Clustering figures ├── pca_explained_variance_curve.png ├── tsne.png └── umap.png Snekmer Model Output Files :::::::::::::::::::::::::: Snekmer's model mode produces the following output files and directories in addition to the files described previously. .. code-block:: console . ├── output/ │ ├── ... │ ├── scoring/ │ │ ├── A.matrix # Similarity matrix for A seqs │ │ ├── B.matrix # Similarity matrix for B seqs │ │ ├── A.scorer # Object to apply A scoring model │ │ ├── B.scorer # Object to apply B scoring model │ │ └── weights/ │ │ ├── A.csv.gz # Kmer score weights in A kmer space │ │ └── B.csv.gz # Kmer score weights in B kmer space │ ├── model/ │ │ ├── A.model # (A/not A) classification model │ │ ├── B.model # (B/not B) classification model │ │ ├── results/ # Cross-validation results tables │ │ │ ├── A.csv │ │ │ └── B.csv │ │ └── figures/ # Cross-validation results figures │ │ ├── A/ │ │ └── B/ Snekmer Search Output Files ::::::::::::::::::::::::::: The ``snekmer search`` mode assumes that the user has pre-generated family models using the ``snekmer model`` workflow, and thus operates as an independent workflow. The location of the basis sets, scorers, and models must be specified in the configuration file (see the search params section in the provided `example `_). For instance, say that the above output examples have already been produced. The user would then like to search a set of unknown sequences against the above families. In a separate directory, the user should place files in an input directory with the appropriate YAML file. The assumed input file structure is as follows: .. code-block:: console . ├── search.yaml ├── input/ │ ├── unknown_1.fasta │ ├── unknown_2.fasta │ └── etc. ├── output/ │ ├── ... │ └── ... The user should then modify their configuration file to point towards the appropriate basis set, scorer, and model directories. Executing ``snekmer search --configfile search.yaml`` produces the following output files and directories in addition to the files described previously. .. code-block:: console . └── output/ ├── kmers/ │ └── common.basis # Common kmer basis set for queried families └── search/ ├── A # A probabilities and predictions for unknown sequences │ ├── unknown_1.csv │ ├── unknown_2.csv │ └── ... └── B # B probabilities and predictions for unknown sequences ├── unknown_1.csv ├── unknown_2.csv └── ... Snekmer Learn Output Files :::::::::::::::::::::::::: Snekmer's learn mode produces the following output files and directories in addition to the files described previously. .. code-block:: console . ├── output/ │ ├── kmerize/ │ │ ├── A.kmers # kmer labels for A │ │ └── B.kmers # kmer labels for B │ ├── vector/ │ │ ├── A.npz # sequences, sequence IDs, and kmer vectors for A │ │ └── B.npz # sequences, sequence IDs, and kmer vectors for B │ ├── vector_frag/ │ │ ├── A.npz # Conditional output for vector when the fragmentation option is True. │ │ └── B.npz # Conditional output for vector when the fragmentation option is True. │ ├── learn/ │ │ ├── kmer-counts-A.csv # Kmer Counts matrix for A seqs │ │ ├── kmer-counts-B.csv # Kmer Counts matrix for B seqs │ │ └── kmer-counts-total.csv # Kmer Counts matrix for merged (total) database. │ ├── eval_apply_sequences/ │ │ ├── seq-annotation-scores-A.model # Self-assessed sequence-annotation cosine similarity scores for A seqs │ │ ├── seq-annotation-scores-B.model # Self-assessed sequence-annotation cosine similarity scores for B seqs │ ├── eval_apply_frag/ │ │ ├── seq-annotation-scores-A.model # Conditional output for eval_apply when the fragmentation option is True. │ │ ├── seq-annotation-scores-B.model # Conditional output for eval_apply when the fragmentation option is True. │ ├── eval_conf/ │ │ ├── global-confidence-scores.csv # Global confidence score distribution │ │ └── confidence_matrix.csv # Confidence distribution Matrix for each annotation │ │ ├── family_summary_stats.csv # Statistics of Apply results for all reversed sequences │ │ └── family_stats_checkpoint.csv # Checkpoint file containing statistics of Apply results for reversed sequences, used to update thresholds when adding new sequences to a family model │ ├── eval_apply_reversed/ │ │ ├── seq-annotation-scores-A.csv.gz # Self-assessed sequence-annotation cosine similarity scores for reversed A sequences │ │ └── seq-annotation-scores-B.csv.gz # Self-assessed sequence-annotation cosine similarity scores for reversed B sequences │ ├── apply_inputs/ │ │ ├── kmer-counts-total.csv │ │ ├── family_summary_stats.csv │ │ └── global-confidence-scores.csv Snekmer Apply Output Files :::::::::::::::::::::::::: Snekmer's apply mode produces the following output files and directories in addition to the files described previously. Predictions are stored in the kmer-summary-x.csv files, which are 5-column CSV files that contain one line (and prediction) per sequence, along with the cosine similarity of each sequence to its predicted family, the difference between the top two scores for each sequence, and the confidence predicted from this difference. The (optional and potentially very large) Seq-Annotation-Scores-x.csv files contain all of the cosine similarity scores calculated, with one row per sequence and one column for each family. .. code-block:: console . ├── output/ │ ├── ... │ ├── apply/ │ │ ├── Seq-Annotation-Scores-C.csv # (optional) Sequence-annotation cosine similarity scores for C seqs │ │ ├── Seq-Annotation-Scores-D.csv # (optional) Sequence-annotation cosine similarity scores for D seqs │ │ ├── kmer-summary-C.csv # Results with annotation predictions and confidence for C seqs │ │ └── kmer-summary-D.csv # Results with annotation predictions and confidence for D seqs Snekmer Motif Output Files :::::::::::::::::::::::::: Snekmer's motif mode produces the following output files and directories in addition to the files described previously. .. code-block:: console . ├── output/ │ ├── ... │ ├── motif/ │ │ ├── kmers/ │ │ │ ├── A.csv # kmers retained for A after recursive feature elimination │ │ │ ├── B.csv # kmers retained for B after recursive feature elimination │ │ ├── preselection/ │ │ │ ├── A.csv # kmer weights learned for A after recursive feature elimination │ │ │ ├── B.csv # kmer weights learned for B after recursive feature elimination │ │ │ ├── A.model # last (A/not A) classification model trained during RFE │ │ │ ├── B.model # last (B/not B) classification model trained during RFE │ │ ├── sequences/ │ │ │ ├── A.csv # Sequence vectors for A using the kmer subset retained after recursive feature elimination │ │ │ ├── B.csv # Sequence vectors for B using the kmer subset retained after recursive feature elimination │ │ ├── scores/ │ │ │ ├── A.csv # kmer weight learned for A on each permute/rescore iteration │ │ │ ├── B.csv # kmer weight learned for B on each permute/rescore iteration │ │ ├── p_values/ │ │ │ ├── A.csv # Tabulated results for A │ │ │ └── B.csv # Tabulated results for B