Snekmer: Reduced K-mer Encoding for Protein Sequences

Snekmer is a software package for protein sequence annotation and analysis using amino acid reduction (AAR) combined with k-mer representations. Given a set of annotated training sequences, Snekmer builds family-specific k-mer profiles and uses cosine similarity to predict annotations for new sequences with calibrated confidence scores. Snekmer also supports unsupervised clustering and supervised machine learning models.

Annotation Pipeline (Learn/Apply)

The primary use case for Snekmer is sequence annotation via the Learn/Apply pipeline.

New to Snekmer? Start here:

  1. Install Snekmer: set up a Python virtual environment and install the package.

  2. Run easy with your training sequences, query sequences, and annotation file:

snekmer easy \
    --train  path/to/training_sequences/ \
    --query  path/to/query_sequences.fasta \
    --ann    path/to/annotations.ann \
    --output results/

easy runs the complete pipeline from a single command with no directory setup or config file required. See the easy tutorial to get started with the included demo data.

Snekmer workflow overview

easy: Guided front-end that runs Learn then Apply end-to-end. Handles workspace setup, annotation generation (from a file or from FASTA headers with --create-ann), and the handoff between pipeline stages automatically.

Learn (advanced): Builds a k-mer association matrix and confidence model from annotated training sequences. Produces three outputs used by Apply: a cumulative k-mer counts matrix, family-level score thresholds, and a global confidence distribution.

Apply (advanced): Scores query sequences against the outputs from Learn using cosine similarity. Produces a prediction table with family assignments and calibrated confidence scores.

Use learn and apply directly when incrementally updating an existing model or when fine-grained control over intermediate pipeline steps is needed.

Additional Modes

Cluster : Unsupervised clustering of sequences based on k-mer profiles. Outputs a cluster assignment table (CSV) and optional figures (t-SNE, UMAP, PCA).

Model : Trains supervised (one-vs-rest) machine learning models from annotated sequences. Outputs model objects (.model) and K-fold cross-validation figures (AUC ROC, PR AUC).

Search : Scores unknown sequences against models produced by snekmer model. Outputs per-family annotation probability tables.