Quick Start
Once Snekmer is installed, you can annotate sequences with a single command with no configuration file or directory setup required.
Try it with the included demo data
The repository includes demo data (5,000 training proteins across 200 TIGRFAM families and 3,000 query proteins), so you can run a complete example immediately after installing. From the root of the Snekmer repository:
snekmer easy \
--train resources/demo_sequences/learn_apply_inputs/learn \
--query resources/demo_sequences/learn_apply_inputs/apply/test_sequences_1.fasta \
--ann resources/demo_sequences/learn_apply_inputs/annotations/TIGRFAMs_annotation.ann \
--output my_first_results
Results will be written to my_first_results/apply/snekmer_results.csv.
See the tutorial for a
full walkthrough of this demo including output interpretation.
What you need
Three inputs:
Training sequences: a FASTA file or directory of FASTA files containing proteins with known family assignments.
Query sequences: a FASTA file or directory of FASTA files to annotate.
Annotations: one of:
--ann PATH: a tab-separated file mapping sequence IDs to family labels:id family A0A2D0MWR0 TIGR04183 A0A1Y4R5C6 TIGR00722
The
idcolumn must match the accession in your training FASTA headers. For UniProt-style headers (>db|ACCESSION|name ...), the accession is the field between the first pair of|characters.Family labels can be any string: database accessions (e.g.
TIGR04183), descriptive names (e.g.nitrogenase), or numbers. Labels are case-sensitive.--create-ann: generate annotations automatically from training FASTA headers. Requires headers in the format>db|FAMILY_LABEL|seqid ..., where the field between the first|pair is used as the family label.
Run
snekmer easy \
--train path/to/training_sequences/ \
--query path/to/query_sequences.fasta \
--ann path/to/annotations.ann \
--output results/
Or, if your FASTA headers encode family labels (>db|FAMILY|seqid ...):
snekmer easy \
--train path/to/training_sequences/ \
--query path/to/query_sequences.fasta \
--create-ann \
--output results/
Snekmer will prompt for any missing inputs if flags are omitted.
Results
The main output is a CSV file at results/apply/snekmer_results.csv:
Sequence Prediction Score delta Confidence
tr|A0A427BXE3|... TIGR01023 0.198 0.08 1.000
tr|A0A401ZGP4|... TIGR00757 0.316 0.02 0.922
...
Prediction: predicted family (highest cosine similarity to training profiles)
Score: cosine similarity to the predicted family
delta: gap between top and second-best score (larger = more certain)
Confidence: calibrated probability the prediction is correct (0–1)
A confidence of ≥ 0.95 is a reliable starting threshold for high-quality annotations.
Next steps
easy tutorial: full walkthrough with demo data, output interpretation, and post-hoc evaluation.
Configuration: tune k-mer length, alphabet, scoring thresholds, and other parameters.
Usage: full reference for all Snekmer modes.