Quick Start

Once Snekmer is installed, you can annotate sequences with a single command with no configuration file or directory setup required.

Try it with the included demo data

The repository includes demo data (5,000 training proteins across 200 TIGRFAM families and 3,000 query proteins), so you can run a complete example immediately after installing. From the root of the Snekmer repository:

snekmer easy \
    --train  resources/demo_sequences/learn_apply_inputs/learn \
    --query  resources/demo_sequences/learn_apply_inputs/apply/test_sequences_1.fasta \
    --ann    resources/demo_sequences/learn_apply_inputs/annotations/TIGRFAMs_annotation.ann \
    --output my_first_results

Results will be written to my_first_results/apply/snekmer_results.csv. See the tutorial for a full walkthrough of this demo including output interpretation.


What you need

Three inputs:

  1. Training sequences: a FASTA file or directory of FASTA files containing proteins with known family assignments.

  2. Query sequences: a FASTA file or directory of FASTA files to annotate.

  3. Annotations: one of:

    • --ann PATH: a tab-separated file mapping sequence IDs to family labels:

      id              family
      A0A2D0MWR0      TIGR04183
      A0A1Y4R5C6      TIGR00722
      

      The id column must match the accession in your training FASTA headers. For UniProt-style headers (>db|ACCESSION|name ...), the accession is the field between the first pair of | characters.

      Family labels can be any string: database accessions (e.g. TIGR04183), descriptive names (e.g. nitrogenase), or numbers. Labels are case-sensitive.

    • --create-ann: generate annotations automatically from training FASTA headers. Requires headers in the format >db|FAMILY_LABEL|seqid ..., where the field between the first | pair is used as the family label.

Run

snekmer easy \
    --train  path/to/training_sequences/ \
    --query  path/to/query_sequences.fasta \
    --ann    path/to/annotations.ann \
    --output results/

Or, if your FASTA headers encode family labels (>db|FAMILY|seqid ...):

snekmer easy \
    --train  path/to/training_sequences/ \
    --query  path/to/query_sequences.fasta \
    --create-ann \
    --output results/

Snekmer will prompt for any missing inputs if flags are omitted.

Results

The main output is a CSV file at results/apply/snekmer_results.csv:

Sequence              Prediction   Score   delta   Confidence
tr|A0A427BXE3|...     TIGR01023    0.198   0.08    1.000
tr|A0A401ZGP4|...     TIGR00757    0.316   0.02    0.922
...
  • Prediction: predicted family (highest cosine similarity to training profiles)

  • Score: cosine similarity to the predicted family

  • delta: gap between top and second-best score (larger = more certain)

  • Confidence: calibrated probability the prediction is correct (0–1)

A confidence of ≥ 0.95 is a reliable starting threshold for high-quality annotations.

Next steps

  • easy tutorial: full walkthrough with demo data, output interpretation, and post-hoc evaluation.

  • Configuration: tune k-mer length, alphabet, scoring thresholds, and other parameters.

  • Usage: full reference for all Snekmer modes.