Snekmer: Reduced K-mer Encoding for Protein Sequences
Snekmer is a software package for protein sequence annotation and analysis using amino acid reduction (AAR) combined with k-mer representations. Given a set of annotated training sequences, Snekmer builds family-specific k-mer profiles and uses cosine similarity to predict annotations for new sequences with calibrated confidence scores. Snekmer also supports unsupervised clustering and supervised machine learning models.
Annotation Pipeline (Learn/Apply)
The primary use case for Snekmer is sequence annotation via the Learn/Apply pipeline.
New to Snekmer? Start here:
Install Snekmer: set up a Python virtual environment and install the package.
Run
easywith your training sequences, query sequences, and annotation file:
snekmer easy \
--train path/to/training_sequences/ \
--query path/to/query_sequences.fasta \
--ann path/to/annotations.ann \
--output results/
easy runs the complete pipeline from a single command with no directory setup
or config file required. See the easy tutorial
to get started with the included demo data.
easy: Guided front-end that runs Learn then Apply end-to-end. Handles workspace
setup, annotation generation (from a file or from FASTA headers with --create-ann), and
the handoff between pipeline stages automatically.
Learn (advanced): Builds a k-mer association matrix and confidence model from annotated training sequences. Produces three outputs used by Apply: a cumulative k-mer counts matrix, family-level score thresholds, and a global confidence distribution.
Apply (advanced): Scores query sequences against the outputs from Learn using cosine similarity. Produces a prediction table with family assignments and calibrated confidence scores.
Use learn and apply directly when incrementally updating an existing model or when
fine-grained control over intermediate pipeline steps is needed.
Additional Modes
Cluster : Unsupervised clustering of sequences based on k-mer profiles. Outputs a cluster assignment table (CSV) and optional figures (t-SNE, UMAP, PCA).
Model : Trains supervised (one-vs-rest) machine learning models from annotated sequences. Outputs model objects (.model) and K-fold cross-validation figures (AUC ROC, PR AUC).
Search : Scores unknown sequences against models produced by snekmer model. Outputs
per-family annotation probability tables.