Quick Start

Once Snekmer is installed, you can annotate sequences with a single command with no configuration file or directory setup required.

Try it with the included demo data

The repository includes demo data (5,000 training proteins across 200 TIGRFAM families and 3,000 query proteins), so you can run a complete example immediately after installing. From the root of the Snekmer repository:

snekmer easy \
    --train  resources/demo_sequences/learn_apply_inputs/learn \
    --query  resources/demo_sequences/learn_apply_inputs/apply/test_sequences_1.fasta \
    --ann    resources/demo_sequences/learn_apply_inputs/annotations/TIGRFAMs_annotation.ann \
    --output my_first_results

Results will be written to my_first_results/apply/snekmer_results.csv. See the tutorial for a full walkthrough of this demo including output interpretation.

What you need

Three inputs:

Training sequences: a FASTA file or directory of FASTA files containing proteins with known family assignments.
Query sequences: a FASTA file or directory of FASTA files to annotate.
Annotations: one of:
- --ann PATH: a tab-separated file mapping sequence IDs to family labels:
```
id              family
A0A2D0MWR0      TIGR04183
A0A1Y4R5C6      TIGR00722
```
  The id column must match the accession in your training FASTA headers. For UniProt-style headers (>db|ACCESSION|name ...), the accession is the field between the first pair of | characters.
  
  Family labels can be any string: database accessions (e.g. TIGR04183), descriptive names (e.g. nitrogenase), or numbers. Labels are case-sensitive.
- --create-ann: generate annotations automatically from training FASTA headers. Requires headers in the format >db|FAMILY_LABEL|seqid ..., where the field between the first | pair is used as the family label.

Run

snekmer easy \
    --train  path/to/training_sequences/ \
    --query  path/to/query_sequences.fasta \
    --ann    path/to/annotations.ann \
    --output results/

Or, if your FASTA headers encode family labels (>db|FAMILY|seqid ...):

snekmer easy \
    --train  path/to/training_sequences/ \
    --query  path/to/query_sequences.fasta \
    --create-ann \
    --output results/

Snekmer will prompt for any missing inputs if flags are omitted.

Results

The main output is a CSV file at results/apply/snekmer_results.csv:

Sequence              Prediction   Score   delta   Confidence
tr|A0A427BXE3|...     TIGR01023    0.198   0.08    1.000
tr|A0A401ZGP4|...     TIGR00757    0.316   0.02    0.922
...

Prediction: predicted family (highest cosine similarity to training profiles)
Score: cosine similarity to the predicted family
delta: gap between top and second-best score (larger = more certain)
Confidence: calibrated probability the prediction is correct (0–1)

A confidence of ≥ 0.95 is a reliable starting threshold for high-quality annotations.

Next steps

easy tutorial: full walkthrough with demo data, output interpretation, and post-hoc evaluation.
Configuration: tune k-mer length, alphabet, scoring thresholds, and other parameters.
Usage: full reference for all Snekmer modes.