Using Snekmer
=============

Snekmer has four modeling operations: ``cluster`` (unsupervised clustering),
``model`` (supervised modeling), ``search`` (application
of model to new sequences), and ``motif`` (feature selection). We will call the first two modes
**learning modes** due to their utility in learning relationships
between protein family input files. Users may choose a mode to best
suit their specific use case.

Snekmer also has two non-modeling operations: ``learn`` (kmer counts matrix generation), 
and ``apply`` (perform cosine similarity between sequences and kmer counts matrix). The Learn/Apply
pipeline can (and should) be used with with large training datasets to quickly find predict 
annotations for new sequences.  

The mode must be specified in the command line, e.g. to specify the
``model`` mode, the following should be called:

.. code-block:: bash

    snekmer model [--options]

In the `resources <https://github.com/PNNL-CompBio/Snekmer/tree/main/resources>`_,
an example configuration file is included:

  - `config.yaml <https://github.com/PNNL-CompBio/Snekmer/blob/main/resources/config.yaml>`_: Configuration file for snekmer execution.

.. code-block:: bash

    snekmer {mode} --dryrun

(For instance, in supervised mode, run ``snekmer model --dryrun``.)

The output of the dry run shows you the files that will be created by the
pipeline. If no files are generated, double-check   that your directory
structure matches the format specified above.

When you are ready to process your files, run:

.. code-block:: bash

    snekmer {mode}

.. _usage-results:

Accessing Results
-----------------

Summary Reports
:::::::::::::::

Each step in the Snekmer modeling pipeline will generate a report in HTML format.
Users can find these reports, entitled **Snekmer_\<MODE\>_Report.html**,
in the output directory.

Snekmer Model Output Files
::::::::::::::::::::

All operation modes will preprocess input files and kmerize sequences.
The associated output files can be found in the respective directories.

The following output directories and files will always be created:

.. code-block:: console

    .
    ├── input/
    │   ├── A.fasta
    │   └── B.fasta
    ├── output/
    │   ├── kmerize/
    │   │   ├── A.kmers  # kmer labels for A
    │   │   └── B.kmers  # kmer labels for B
    │   ├── vector/
    │   │   ├── A.npz    # sequences, sequence IDs, and kmer vectors for A
    │   │   └── B.npz    # sequences, sequence IDs, and kmer vectors for B
    │   ├── ...

Mode-Specific Output Files
--------------------------

The steps in the Snekmer pipeline generate their own associated output files.

Snekmer Cluster Output Files
::::::::::::::::::::::::::::

Snekmer's cluster mode produces the following output files
and directories in addition to the files described previously.

.. code-block:: console

    .
    └── output/
        ├── ...
        └── cluster/
            ├── snekmer.csv     # Summary of clustering results
            └── figures/        # Clustering figures
                ├── pca_explained_variance_curve.png
                ├── tsne.png
                └── umap.png

Snekmer Model Output Files
::::::::::::::::::::::::::

Snekmer's model mode produces the following output files
and directories in addition to the files described previously.

.. code-block:: console

    .
    ├── output/
    │   ├── ...
    │   ├── scoring/
    │   │   ├── A.matrix    # Similarity matrix for A seqs
    │   │   ├── B.matrix    # Similarity matrix for B seqs
    │   │   ├── A.scorer    # Object to apply A scoring model
    │   │   ├── B.scorer    # Object to apply B scoring model
    │   │   └── weights/
    │   │       ├── A.csv.gz    # Kmer score weights in A kmer space
    │   │       └── B.csv.gz    # Kmer score weights in B kmer space
    │   ├── model/
    │   │   ├── A.model     # (A/not A) classification model
    │   │   ├── B.model     # (B/not B) classification model
    │   │   ├── results/    # Cross-validation results tables
    │   │   │   ├── A.csv
    │   │   │   └── B.csv
    │   │   └── figures/      # Cross-validation results figures
    │   │       ├── A/
    │   │       └── B/

Snekmer Search Output Files
:::::::::::::::::::::::::::

The ``snekmer search`` mode assumes that the user has pre-generated
family models using the ``snekmer model`` workflow, and thus operates
as an independent workflow. The location of the basis sets, scorers,
and models must be specified in the configuration file (see the search
params section in the provided
`example <https://github.com/PNNL-CompBio/Snekmer/blob/main/resources/config.yaml>`_).

For instance, say that the above output examples have already been
produced. The user would then like to search a set of unknown
sequences against the above families.

In a separate directory, the user should place files in an input
directory with the appropriate YAML file. The assumed input file
structure is as follows:

.. code-block:: console

    .
    ├── search.yaml
    ├── input/
    │   ├── unknown_1.fasta
    │   ├── unknown_2.fasta
    │   └── etc.
    ├── output/
    │   ├── ...
    │   └── ...

The user should then modify their configuration file to point towards
the appropriate basis set, scorer, and model directories.

Executing ``snekmer search --configfile search.yaml`` produces the
following output files and directories in addition to the files
described previously.

.. code-block:: console

    .
    └── output/
        ├── kmers/
        │   └── common.basis  # Common kmer basis set for queried families
        └── search/
            ├── A   # A probabilities and predictions for unknown sequences
            │   ├── unknown_1.csv
            │   ├── unknown_2.csv
            │   └── ...
            └── B   # B probabilities and predictions for unknown sequences
                ├── unknown_1.csv
                ├── unknown_2.csv
                └── ...  


Snekmer Learn Output Files
::::::::::::::::::::::::::

Snekmer's learn mode produces the following output files
and directories in addition to the files described previously.

.. code-block:: console

    .
    ├── output/
    │   ├── kmerize/
    │   │   ├── A.kmers  # kmer labels for A
    │   │   └── B.kmers  # kmer labels for B
    │   ├── vector/
    │   │   ├── A.npz    # sequences, sequence IDs, and kmer vectors for A
    │   │   └── B.npz    # sequences, sequence IDs, and kmer vectors for B
    │   ├── vector_frag/ 
    │   │   ├── A.npz    # Conditional output for vector when the fragmentation option is True.
    │   │   └── B.npz    # Conditional output for vector when the fragmentation option is True.
    │   ├── learn/
    │   │   ├── kmer-counts-A.csv    # Kmer Counts matrix for A seqs
    │   │   ├── kmer-counts-B.csv     # Kmer Counts matrix for B seqs
    │   │   └── kmer-counts-total.csv    # Kmer Counts matrix for merged (total) database.
    │   ├── eval_apply_sequences/
    │   │   ├── seq-annotation-scores-A.model     # Self-assessed sequence-annotation cosine similarity scores for A seqs
    │   │   ├── seq-annotation-scores-B.model     # Self-assessed sequence-annotation cosine similarity scores for B seqs
    │   ├── eval_apply_frag/
    │   │   ├── seq-annotation-scores-A.model     # Conditional output for eval_apply when the fragmentation option is True.
    │   │   ├── seq-annotation-scores-B.model     # Conditional output for eval_apply when the fragmentation option is True.
    │   ├── eval_conf/
    │   │   ├── global-confidence-scores.csv     # Global confidence score distribution
    │   │   └── confidence_matrix.csv   # Confidence distribution Matrix for each annotation
    │   │   ├── family_summary_stats.csv # Statistics of Apply results for all reversed sequences
    │   │   └── family_stats_checkpoint.csv # Checkpoint file containing statistics of Apply results for reversed sequences, used to update thresholds when adding new sequences to a family model
    │   ├── eval_apply_reversed/ 
    │   │   ├── seq-annotation-scores-A.csv.gz # Self-assessed sequence-annotation cosine similarity scores for reversed A sequences
    │   │   └── seq-annotation-scores-B.csv.gz # Self-assessed sequence-annotation cosine similarity scores for reversed B sequences
    │   ├── apply_inputs/
    │   │   ├── kmer-counts-total.csv 
    │   │   ├── family_summary_stats.csv
    │   │   └── global-confidence-scores.csv

Snekmer Apply Output Files
::::::::::::::::::::::::::

Snekmer's apply mode produces the following output files
and directories in addition to the files described previously.
Predictions are stored in the kmer-summary-x.csv files, which are 5-column CSV files that contain one line (and prediction) per sequence, along with the cosine similarity of each sequence to its predicted family, the difference between the top two scores for each sequence, and the confidence predicted from this difference.
The (optional and potentially very large) Seq-Annotation-Scores-x.csv files contain all of the cosine similarity scores calculated, with one row per sequence and one column for each family.

.. code-block:: console

    .
    ├── output/
    │   ├── ...
    │   ├── apply/
    │   │   ├── Seq-Annotation-Scores-C.csv  # (optional) Sequence-annotation cosine similarity scores for C seqs
    │   │   ├── Seq-Annotation-Scores-D.csv  # (optional) Sequence-annotation cosine similarity scores for D seqs
    │   │   ├── kmer-summary-C.csv  # Results with annotation predictions and confidence for C seqs 
    │   │   └── kmer-summary-D.csv  # Results with annotation predictions and confidence for D seqs 

Snekmer Motif Output Files
::::::::::::::::::::::::::

Snekmer's motif mode produces the following output files and directories in addition to the files described previously.

.. code-block:: console

    .
    ├── output/
    │   ├── ...
    │   ├── motif/
    │   │   ├── kmers/
    │   │   │   ├── A.csv  # kmers retained for A after recursive feature elimination
    │   │   │   ├── B.csv  # kmers retained for B after recursive feature elimination
    │   │   ├── preselection/
    │   │   │   ├── A.csv  # kmer weights learned for A after recursive feature elimination
    │   │   │   ├── B.csv  # kmer weights learned for B after recursive feature elimination
    │   │   │   ├── A.model  # last (A/not A) classification model trained during RFE
    │   │   │   ├── B.model  # last (B/not B) classification model trained during RFE
    │   │   ├── sequences/
    │   │   │   ├── A.csv  # Sequence vectors for A using the kmer subset retained after recursive feature elimination
    │   │   │   ├── B.csv  # Sequence vectors for B using the kmer subset retained after recursive feature elimination
    │   │   ├── scores/
    │   │   │   ├── A.csv  # kmer weight learned for A on each permute/rescore iteration
    │   │   │   ├── B.csv  # kmer weight learned for B on each permute/rescore iteration
    │   │   ├── p_values/
    │   │   │   ├── A.csv  # Tabulated results for A
    │   │   │   └── B.csv  # Tabulated results for B