{ "cells": [ { "cell_type": "markdown", "id": "441b0477-d63c-4501-9a39-adbf07634e7e", "metadata": {}, "source": [ "# Demo: Snekmer Learn/Apply\n", "\n", "\n", "**Snekmer Learn/Apply** is a protein-function annotation framework that represents proteins as **kmer vectors** and uses **cosine similarity** to predict their functional families. \n", "It operates in two stages:\n", "\n", "**Learn** \n", "Builds a kmer–based model from a curated training set of annotated genomes.\n", "During learning, Snekmer counts k-mers across all training proteins, computes class-specific k-mer signatures, and generates summary statistics (e.g., confidence tables, decoy thresholds).\n", "Increasing the size and diversity of the training set generally improves predictive accuracy.\n", "\n", "**Apply** \n", "Uses the outputs from Learn: kmer signatures, confidence tables, thresholds, and compares them against novel protein sequences or genomes.\n", "For each new protein, Snekmer computes the kmer vector and compares it to the learned signature dataset via cosine similarity, and assigns the most likely family along with a confidence value.\n", "\n", "\n", "**Demo data** \n", "In this notebook, we will demonstrate how to use **Snekmer Learn** and **Snkemer Apply** with small training dataset of 10 annotated fasta files and 1 unannotated fasta.\n", "\n", "The training set contains 5,000 annotated proteins drawn from 200 TIGRFAM families (50 per family; no unannotated sequences). The test set includes 3,000 proteins balanced across three groups: in-family (selected TIGRFAMs), other annotated families, and unannotated sequences (1000 each).\n", "\n", "\n", "## Getting Started with Snekmer Learn\n", "\n", "### Setup\n", "\n", "First, install Snekmer using the instructions in the [user installation guide](https://snekmer.readthedocs.io/en/learn_decoy_implementation/getting_started/install.html).\n", "\n", "Before running Snekmer, verify that files have been placed in an **_input_** directory placed at the same level as the **_config.yaml_** file. The assumed file directory structure is illustrated below.\n", "\n", " .\n", " ├── input\n", " │ ├── A.fasta\n", " │ ├── B.fasta\n", " │ ├── C.fasta\n", " │ ├── D.fasta\n", " │ └── etc.\n", " ├── config.yaml\n", " ├── annotations\n", " └── annotations.ann\n", " \n", "(Note: Snekmer automatically creates the **_output_** directory when creating output files, so there is no need to create this folder in advance.)\n", "\n", "To ensure that snekmer is available in the Jupyter notebook do the following in bash after [installing ipykernel](https://pypi.org/project/ipykernel/):\n", "```\n", "source ~/snekmer_env/bin/activate\n", "python -m ipykernel install --user --name=snekmer\n", "jupyter notebook\n", "```" ] }, { "cell_type": "markdown", "id": "b3e77267", "metadata": {}, "source": [ "### Annotation File \n", "\n", "The annotation file is a two-columned TSV. The first column, **id**, refers to the **sequence id**. The second column, titled **family**, refers to the family, function, or group to sum on in the kmer-association matrix.\n", "\n", "| id | family |\n", "| ---------------- | --------- |\n", "| A0A2D0MWR0 | TIGR04183 |\n", "| A0A2D0MY79 | TIGR04131 |\n", "| **A0A1Y4R5C6** | TIGR00722 |\n", "\n", "\n", "Example of a matching **sequence id** in a fasta file:\n", "\n", "\\>tr|**A0A1Y4R5C6**|A0A1Y4R5C6_9FIRM Fumarate hydratase OS=Lachnoclostridium sp. An14 OX=1965562 GN=B5E84_18440 PE=3 SV=1\n", "MREIQASQITQAVRDMCIEANYSLSPDMRQRFETAADQEESPLGKMIFGQLKENLDIAQQ\n", "DQIPICQDTGMAVVFVNVGQEVHIDGDLTAAVNEGVRLGYEEGYLRKSVVRDPIERENTR\n", "DNTPAVLHTSLVPGDQVEITVAPKGFGSENMSRIFMLKPADGLEGVKQAILTAVRDAGPN\n", "ACPPMVVGVGIGGTFEKCALMAKHALTRDVNESSPIPYVRELEQEMLNRINGLGIGPGGL\n", "GGAITALAVNIETYPTHIAGLPVAVNICCHVNRHAKRIL" ] }, { "cell_type": "markdown", "id": "3e3730b1", "metadata": {}, "source": [ "### Notes on Using Snekmer\n", "\n", "Snekmer assumes that the user will primarily process input files using the command line. For more detailed instructions, refer to the [README](https://github.com/PNNL-CompBio/Snekmer).\n", "\n", "The basic process for running Snekmer Learn/Apply is as follows:\n", "\n", "1. Verify that your file directory structure is correct and that the top-level directory contains a **_config.yaml_** file.\n", " - A **_config.yaml_** template has been included in the Snekmer codebase at **_resources/learn_apply/config.yaml_**.\n", "2. Modify the **_config.yaml_** with the desired parameters.\n", "3. Use the command line to navigate to the directory containing both the **_config.yaml_** file and **_input_** directory.\n", "4. Run `snekmer learn`, then copy the appropriate outputs to a seperate directory to run `snekmer apply`\n" ] }, { "cell_type": "markdown", "id": "cb714c7d", "metadata": {}, "source": [ "## Running Snekmer Learn Pipeline\n", "\n", "### Setup\n", "\n", "**Note:** This notebook assumes you are running it from the **`resources/tutorial/`** directory, where this notebook file is located.\n", "\n", "To set up the workflow such that operation mimics the command line implementation of Snekmer Learn/Apply, we will initialize a dictionary (rather than a YAML file) and gather all input files. Input files are detected here using `glob.glob`, exactly as Snekmer performs input file detection." ] }, { "cell_type": "code", "execution_count": null, "id": "c7544bc5", "metadata": {}, "outputs": [], "source": [ "# --- Standard library ---\n", "import os, sys, shutil, gzip, pickle\n", "from glob import glob\n", "from pathlib import Path\n", "# --- Third-party ---\n", "import numpy as np\n", "import pandas as pd\n", "from Bio import SeqIO\n", "import matplotlib.pyplot as plt\n", "# --- Snekmer core APIs---\n", "import snekmer as skm\n", "from snekmer.vectorize import KmerVec, reduce, FULL_ALPHABETS\n", "from snekmer.io import read_kmers\n", "# --- Snekmer scripts ---\n", "sys.path.append('../../snekmer/scripts')\n", "import learn_learn, learn_merge, learn_eval_apply_sequences\n", "import learn_eval_apply_reverse_seqs, learn_evaluate_sequences\n", "import learn_reverse_decoy_evaluations\n", "import apply as apply_script" ] }, { "cell_type": "markdown", "id": "fc39246f", "metadata": {}, "source": [ "### Build the Tutorial Working Directory (from `demo_sequences`)\n", "\n", "This step creates the `learn` and `apply` directories used throughout the notebook by **copying** the demo inputs and annotations from `../demo_sequences/learn_apply_inputs`." ] }, { "cell_type": "code", "execution_count": 20, "id": "a1de391d", "metadata": {}, "outputs": [], "source": [ "# Build learn apply driectories from demo_sequences\n", "for d in [\"learn/input\",\"learn/annotations\",\"apply/input\",\"apply/annotations\"]:\n", " os.makedirs(d, exist_ok=True)\n", "# config\n", "shutil.copy2(\"../config.yaml\",\"learn/config.yaml\")\n", "shutil.copy2(\"../config.yaml\",\"apply/config.yaml\")\n", "# annotations\n", "ann = \"../demo_sequences/learn_apply_inputs/annotations/TIGRFAMs_annotation.ann\"\n", "shutil.copy2(ann, \"learn/annotations/TIGRFAMs_annotation.ann\")\n", "# learn inputs\n", "for i in [1,2,3,4,5,6,7,8,9,10]:\n", " shutil.copy2(f\"../demo_sequences/learn_apply_inputs/learn/training_sequences_{i}.fasta\",\n", " f\"learn/input/training_sequences_{i}.fasta\")\n", "# apply inputs\n", "for src in glob(\"../demo_sequences/learn_apply_inputs/apply/test_sequences_*.fasta\"):\n", " shutil.copy2(src, \"apply/input/\")" ] }, { "cell_type": "markdown", "id": "489c0a97", "metadata": {}, "source": [ "### Configuration File Input\n", "Please note, this is a stripped down version of the normal configuration file." ] }, { "cell_type": "code", "execution_count": 21, "id": "01a798b9", "metadata": {}, "outputs": [], "source": [ "config = {\n", " ### Base Parameters\n", " \"k\": 8, # K-mer size\n", " \"alphabet\": 2, # Alphabet selection at https://snekmer.readthedocs.io/en/latest/background/overview.html#alphabets\n", " ### INPUT HANDLING\n", " \"input\": {\n", " \"file_extensions\": [\"fasta\", \"fna\", \"faa\", \"fa\"]\n", " },\n", " ### Learn & Apply Parameters\n", " \"learn_apply\": {\n", " \"save_apply_associations\": False,\n", " # Fragmentation settings\n", " \"fragmentation\": False, # Please note, the notebook is not set up to use the fragmentation option.\n", " \"version\": \"absolute\",\n", " \"frag_length\": 50,\n", " \"min_length\": 50,\n", " \"location\": \"random\",\n", " \"seed\": 999,\n", " # Confidence weighting # This parameter is meant for subsequential data additions and is not supported in the notebook\n", " \"conf_weight_modifier\": 20,\n", " # Selection methods: \"top_hit\", \"greatest_distance\", \"combined_distance\"\n", " \"selection\": \"top_hit\",\n", " # Threshold column from family_summary_stats.csv or None\n", " \"threshold\": \"Median\",\n", " # Weights are only relevant if using combined_distance\n", " \"weight_top\": 0.5,\n", " \"weight_distance\": 0.5,\n", " # Output naming\n", " \"apply_output\": \"snekmer_results.csv\", # Not applicable in notebook, but this is where final results would be saved.\n", " },\n", "}" ] }, { "cell_type": "markdown", "id": "265cd443", "metadata": {}, "source": [ "### Rule 0: Collect input files\n", "\n", "Before executing the workflow, Snekmer scans the input directory and gathers all files whose extensions match those defined in the configuration (e.g., `.fasta`, `.fa`, and their `.gz`-compressed equivalents). These files must contain protein sequences in FASTA format.\n", "\n", "In this notebook, the demo data directory is referenced explicitly at `learn/input/`. However, when using the full Snekmer CLI, the tool expects your files to be organized following the directory structure described in the Setup section above." ] }, { "cell_type": "code", "execution_count": 22, "id": "252902b6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "unzipped files:\t ['learn/input/training_sequences_3.fasta', 'learn/input/training_sequences_1.fasta', 'learn/input/training_sequences_5.fasta', 'learn/input/training_sequences_10.fasta', 'learn/input/training_sequences_7.fasta', 'learn/input/training_sequences_2.fasta', 'learn/input/training_sequences_6.fasta', 'learn/input/training_sequences_4.fasta', 'learn/input/training_sequences_8.fasta', 'learn/input/training_sequences_9.fasta']\n" ] } ], "source": [ "# Collect all fasta files. Paths are relative to resources/tutorial/\n", "input_dir = \"learn/input/\"\n", "unzipped = []\n", "for ext in config[\"input\"][\"file_extensions\"]:\n", " unzipped.extend(glob(os.path.join(input_dir, f\"*.{ext}\")))\n", "print(\"unzipped files:\\t\", unzipped)\n", "\n", "# Define output directory (and create if missing)\n", "output_dir = \"learn/output\"\n", "if not os.path.exists(output_dir):\n", " os.makedirs(output_dir)" ] }, { "cell_type": "markdown", "id": "9c42bae1", "metadata": {}, "source": [ "### Vectorize Helper Function\n", "This helper function replicates the k-mer vectorization step normally performed inside the Snakemake workflow (kmerize, vectorize). Given a FASTA file, it constructs (or loads) a k-mer basis, reduces amino-acid sequences to the chosen alphabet, generates a binary presence/absence matrix of kmers for every protein, and writes the results to the same .npz and .kmers formats used by the full Learn pipeline. This allows the notebook to produce vectorized inputs that behave exactly like those generated by the CLI workflow." ] }, { "cell_type": "code", "execution_count": null, "id": "3bca66d3", "metadata": {}, "outputs": [], "source": [ "def run_vectorize_like_snakemake(\n", " fasta_path: str,\n", " alphabet,\n", " k: int,\n", " output_npz: str,\n", " output_kmerobj: str,\n", " basis_path: str = None,\n", " min_filter: int = 0,\n", "):\n", " \"\"\"\n", " Notebook helper that replicates kmerize.smk::vectorize.\n", " \"\"\"\n", " kmer = KmerVec(alphabet=alphabet, k=k)\n", " if basis_path is not None and os.path.exists(basis_path):\n", " kmerbasis = np.array(read_kmers(basis_path), dtype=object)\n", " nprot = sum(1 for line in open(fasta_path) if line.startswith(\">\"))\n", " else:\n", " kmer_counts = {}\n", " nprot = 0\n", " for f in SeqIO.parse(fasta_path, \"fasta\"):\n", " nprot += 1\n", " these = kmer.reduce_vectorize(f.seq) \n", " for key in these:\n", " kmer_counts[key] = kmer_counts.get(key, 0) + 1\n", "\n", " if kmer_counts:\n", " all_kmers = np.array(list(kmer_counts.keys()), dtype=object)\n", " all_counts = np.array(list(kmer_counts.values()))\n", " kmerbasis = all_kmers[all_counts > min_filter]\n", " else:\n", " kmerbasis = np.empty(0, dtype=object)\n", " kmer.set_kmer_set(list(kmerbasis))\n", " vecs = np.zeros((nprot, len(kmerbasis)), dtype=int)\n", " seqs, ids, lengths = [], [], []\n", " n = 0\n", "\n", " fasta = SeqIO.parse(fasta_path, \"fasta\")\n", " for f in fasta:\n", " addvec = kmer.reduce_vectorize(f.seq) \n", " if len(kmerbasis) > 0 and len(addvec) > 0:\n", " vecs[n][np.isin(kmerbasis, addvec)] = 1\n", " seqs.append(\n", " reduce(\n", " f.seq,\n", " alphabet=alphabet,\n", " mapping=FULL_ALPHABETS,\n", " )\n", " )\n", " ids.append(f.id)\n", " lengths.append(len(f.seq))\n", " n += 1\n", " \n", " np.savez_compressed(\n", " output_npz,\n", " kmerlist=kmerbasis,\n", " ids=ids,\n", " seqs=seqs,\n", " vecs=vecs,\n", " lengths=lengths,\n", " )\n", "\n", " with open(output_kmerobj, \"wb\") as f:\n", " pickle.dump(kmer, f)" ] }, { "cell_type": "markdown", "id": "3e019930", "metadata": {}, "source": [ "### Rule 1: Kmerize / Vectorize\n", "\n", "In this step, each input FASTA file is converted into a kmer–based feature matrix using the k value and reduced amino-acid alphabet defined in the config. For every protein sequence, Snekmer computes its reduced sequence and a binary presence/absence vector indicating which kmers appear.\n", "\n", "In the notebook, this is performed using the helper function above (`run_vectorize_like_snakemake`) that replicates the behavior of the Snakemake `kmerize.smk::vectorize` rule, producing output files identical in structure to those generated by the CLI.\n", "\n", "The result is one `.npz` file per input FASTA containing the sequence IDs, reduced sequences, k-mer basis, and the corresponding binary kmer vectors used by the Learn stage." ] }, { "cell_type": "code", "execution_count": 24, "id": "86378b89", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Vectorized learn/input/training_sequences_3.fasta to learn/output/vector/training_sequences_3.npz\n", "Vectorized learn/input/training_sequences_1.fasta to learn/output/vector/training_sequences_1.npz\n", "Vectorized learn/input/training_sequences_5.fasta to learn/output/vector/training_sequences_5.npz\n", "Vectorized learn/input/training_sequences_10.fasta to learn/output/vector/training_sequences_10.npz\n", "Vectorized learn/input/training_sequences_7.fasta to learn/output/vector/training_sequences_7.npz\n", "Vectorized learn/input/training_sequences_2.fasta to learn/output/vector/training_sequences_2.npz\n", "Vectorized learn/input/training_sequences_6.fasta to learn/output/vector/training_sequences_6.npz\n", "Vectorized learn/input/training_sequences_4.fasta to learn/output/vector/training_sequences_4.npz\n", "Vectorized learn/input/training_sequences_8.fasta to learn/output/vector/training_sequences_8.npz\n", "Vectorized learn/input/training_sequences_9.fasta to learn/output/vector/training_sequences_9.npz\n" ] } ], "source": [ "vector_dir = os.path.join(output_dir, \"vector\")\n", "kmer_dir = os.path.join(output_dir, \"kmerize\")\n", "os.makedirs(vector_dir, exist_ok=True)\n", "os.makedirs(kmer_dir, exist_ok=True)\n", "\n", "for fa in unzipped:\n", " base = os.path.basename(skm.utils.split_file_ext(fa)[0])\n", " out_npz = os.path.join(vector_dir, f\"{base}.npz\")\n", " out_kmer = os.path.join(kmer_dir, f\"{base}.kmers\")\n", "\n", " #call helper function\n", " run_vectorize_like_snakemake(\n", " fasta_path=fa,\n", " alphabet=config[\"alphabet\"],\n", " k=config[\"k\"],\n", " output_npz=out_npz,\n", " output_kmerobj=out_kmer,\n", " basis_path=None,\n", " min_filter=0,\n", " )\n", "\n", " print(f\"Vectorized {fa} to {out_npz}\")\n" ] }, { "cell_type": "markdown", "id": "c169d7c6", "metadata": {}, "source": [ "### Rule 2: Learn\n", "\n", "In this step, Snekmer converts each vectorized FASTA file into a **kmer count matrix**, where each row represents a protein family and each column represents kmer frequency.\n", "\n", "Each input FASTA produces its own per-family kmer count matrix; these matrices are then merged in the next rule to produce the **cumulative kmer counts** used as the Learn model. Conceptually, this step is where Snekmer “learns” the characteristic kmer profiles of each annotated family, forming the core reference used for annotation during Apply." ] }, { "cell_type": "code", "execution_count": 25, "id": "ffa8c2e1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Counts Data Generated for: learn/output/vector/training_sequences_3.npz\n", "Counts Data Generated for: learn/output/vector/training_sequences_1.npz\n", "Counts Data Generated for: learn/output/vector/training_sequences_5.npz\n", "Counts Data Generated for: learn/output/vector/training_sequences_10.npz\n", "Counts Data Generated for: learn/output/vector/training_sequences_7.npz\n", "Counts Data Generated for: learn/output/vector/training_sequences_2.npz\n", "Counts Data Generated for: learn/output/vector/training_sequences_6.npz\n", "Counts Data Generated for: learn/output/vector/training_sequences_4.npz\n", "Counts Data Generated for: learn/output/vector/training_sequences_8.npz\n", "Counts Data Generated for: learn/output/vector/training_sequences_9.npz\n" ] } ], "source": [ "# This cell runs the logic from `learn_learn.py`\n", "learn_output_dir = os.path.join(output_dir, \"learn\")\n", "os.makedirs(learn_output_dir, exist_ok=True)\n", "annot_files = glob(os.path.join(\"learn/annotations\", \"TIGRFAMs_annotation.ann\"))\n", "\n", "for fa in unzipped:\n", " base_name = os.path.basename(skm.utils.split_file_ext(fa)[0])\n", " input_data_path = os.path.join(output_dir, \"vector\", f\"{base_name}.npz\")\n", "\n", " library = learn_learn.Library(out_dir=output_dir)\n", " library.execute_all(annot_files, input_data_path)\n", " print(f\"Counts Data Generated for: {input_data_path}\")" ] }, { "cell_type": "markdown", "id": "228c7834", "metadata": {}, "source": [ "### Rule 3: Merge\n", "\n", "In this step, all previously generated per-file kmer count matrices are merged into a single cumulative counts table. When running the pipeline through the CLI, you may also optionally provide an existing counts file to merge in. This enables **additive integration of kmer counts**, allowing large projects to scale by incrementally combining new training data without rerunning earlier steps (not demonstrated in this notebook).\n", "\n", "The result is a unified kmer count matrix representing the full training set." ] }, { "cell_type": "code", "execution_count": 26, "id": "9d099124", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataframes merged: 0 out of 10\n", "Dataframes merged: 1 out of 10\n", "Dataframes merged: 2 out of 10\n", "Dataframes merged: 3 out of 10\n", "Dataframes merged: 4 out of 10\n", "Dataframes merged: 5 out of 10\n", "Dataframes merged: 6 out of 10\n", "Dataframes merged: 7 out of 10\n", "Dataframes merged: 8 out of 10\n", "Dataframes merged: 9 out of 10\n", "\n", "Checking for base file to merge with.\n", "\n", "No base directory detected\n", "\n", "\n", "Database Merged. Not merged with base file.\n", "\n", "Merge complete. Output saved to learn/output/learn/kmer_counts_total.csv\n" ] } ], "source": [ "# This cell runs the logic from `learn_merge.py`\n", "input_counts = glob(os.path.join(output_dir, \"learn\", \"kmer_counts_*.csv\"))\n", "\n", "# Check for a base file (to add to existing counts).\n", "# In this tutorial, we don't use one, so we pass an empty string\n", "base_counts_path = \"\"\n", "output_totals_path = os.path.join(output_dir, \"learn\", \"kmer_counts_total.csv\")\n", "\n", "learn_merge.run_merge(\n", " counts_files=input_counts,\n", " base_counts_path=base_counts_path, \n", " output_path=output_totals_path\n", ")\n", "print(f\"Merge complete. Output saved to {output_totals_path}\")" ] }, { "cell_type": "markdown", "id": "279d5576", "metadata": {}, "source": [ "### Rule 4: Eval_Apply\n", "\n", "In this step, Snekmer computes the cosine similarity between the merged kmer count database (the learned reference profiles) and the kmer vector for each sequence in the input FASTA files. This effectively measures how similar each sequence is to every annotated family based on its kmer composition.\n", "\n", "In the Learn pipeline, this serves as a **self-evaluation step**: we compare each sequence’s predicted family (based on similarity) to its known annotation. The resulting similarity scores are then used in the next rule to compute **family-level confidence metrics**." ] }, { "cell_type": "code", "execution_count": 27, "id": "2a5e2e6f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "File completed: seq_annotation_scores_training_sequences_3.csv.gz\n", "File completed: seq_annotation_scores_training_sequences_1.csv.gz\n", "File completed: seq_annotation_scores_training_sequences_5.csv.gz\n", "File completed: seq_annotation_scores_training_sequences_10.csv.gz\n", "File completed: seq_annotation_scores_training_sequences_7.csv.gz\n", "File completed: seq_annotation_scores_training_sequences_2.csv.gz\n", "File completed: seq_annotation_scores_training_sequences_6.csv.gz\n", "File completed: seq_annotation_scores_training_sequences_4.csv.gz\n", "File completed: seq_annotation_scores_training_sequences_8.csv.gz\n", "File completed: seq_annotation_scores_training_sequences_9.csv.gz\n" ] } ], "source": [ "# This cell runs the logic from `learn_eval_apply_sequences.py`\n", "eval_apply_dir = os.path.join(output_dir, \"eval_apply\")\n", "os.makedirs(eval_apply_dir, exist_ok=True)\n", "compare_associations_path = os.path.join(output_dir, \"learn\", \"kmer_counts_total.csv\")\n", "annot_files = glob(\"learn/annotations/*.ann\")\n", "\n", "for fa in unzipped:\n", " base_name = os.path.basename(skm.utils.split_file_ext(fa)[0])\n", " input_data_path = os.path.join(output_dir, \"vector\", f\"{base_name}.npz\")\n", " out_name = f\"seq_annotation_scores_{base_name}.csv.gz\"\n", " output_apply_path = os.path.join(eval_apply_dir, out_name)\n", "\n", " evaluator = learn_eval_apply_sequences.EvaluateSequences(\n", " compare_associations=compare_associations_path,\n", " annotation_files=annot_files,\n", " input_data=input_data_path,\n", " output_path=output_apply_path,\n", " )\n", " \n", " evaluator.execute_all(config=config)\n", " print(f\"File completed: {out_name}\")" ] }, { "cell_type": "markdown", "id": "65ffad99-9f03-4245-8b88-35eca8818b85", "metadata": {}, "source": [ "### Rule 5: Eval_Apply_Reverse_Sequences\n", "\n", "This rule repeats the Eval_Apply procedure, but using **reversed** versions of every sequence. Reversing removes the biological signal while preserving length and composition, producing a baseline (null) distribution of cosine similarities. These “reverse-sequence” similarity scores serve as a **background model**, allowing Snekmer to compute family-specific thresholds that **distinguish true signal from noise** in the next step." ] }, { "cell_type": "code", "execution_count": 28, "id": "5c9b90d7-5f16-4cba-882d-164a5125b442", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "File completed: seq_annotation_scores_training_sequences_3.csv.gz\n", "File completed: seq_annotation_scores_training_sequences_1.csv.gz\n", "File completed: seq_annotation_scores_training_sequences_5.csv.gz\n", "File completed: seq_annotation_scores_training_sequences_10.csv.gz\n", "File completed: seq_annotation_scores_training_sequences_7.csv.gz\n", "File completed: seq_annotation_scores_training_sequences_2.csv.gz\n", "File completed: seq_annotation_scores_training_sequences_6.csv.gz\n", "File completed: seq_annotation_scores_training_sequences_4.csv.gz\n", "File completed: seq_annotation_scores_training_sequences_8.csv.gz\n", "File completed: seq_annotation_scores_training_sequences_9.csv.gz\n" ] } ], "source": [ "# This cell runs the logic from `learn_eval_apply_reverse_seqs.py`\n", "eval_apply_rev_dir = os.path.join(output_dir, \"eval_apply_reversed\")\n", "os.makedirs(eval_apply_rev_dir, exist_ok=True)\n", "\n", "compare_associations_path = os.path.join(output_dir, \"learn\", \"kmer_counts_total.csv\")\n", "annot_files = glob(\"learn/annotations/*.ann\")\n", "\n", "for fa in unzipped:\n", " base_name = os.path.basename(skm.utils.split_file_ext(fa)[0])\n", " input_data_path = os.path.join(output_dir, \"vector\", f\"{base_name}.npz\")\n", " \n", " out_name = f\"seq_annotation_scores_{base_name}.csv.gz\"\n", " output_apply_path = os.path.join(eval_apply_rev_dir, out_name)\n", "\n", " # Initialize the class from the imported script\n", " reverse_evaluator = learn_eval_apply_reverse_seqs.CompareReverseSeqs(\n", " compare_associations=compare_associations_path,\n", " annotation_files=annot_files,\n", " input_data=input_data_path,\n", " output_path=output_apply_path,\n", " )\n", " \n", " reverse_evaluator.execute_all()\n", " print(f\"File completed: {out_name}\")" ] }, { "cell_type": "markdown", "id": "c666673e", "metadata": {}, "source": [ "### Rule 6: ReverseDecoy_Evaluations (Family Statistics)\n", "\n", "In this step, we compute summary statistics for each family using the cosine similarity scores generated from **reversed sequences** (Rule 5). Because reversed sequences remove biological signal, their similarity scores form a decoy or null distribution.\n", "\n", "Snekmer aggregates these decoy scores per family to produce family-level background statistics (e.g., medians, quantiles, maxima). These statistics define what “noise-level similarity” looks like for each family and are later used to establish confidence thresholds." ] }, { "cell_type": "code", "execution_count": null, "id": "70fd3155", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Family summary stats created at: learn/output/eval_conf/family_summary_stats.csv\n" ] } ], "source": [ "# This cell runs the logic from `learn_reverse_decoy_evaluations.py`\n", "eval_conf_dir = os.path.join(output_dir, \"eval_conf\")\n", "os.makedirs(eval_conf_dir, exist_ok=True)\n", "\n", "eval_apply_reversed_data = glob(os.path.join(output_dir, \"eval_apply_reversed\", \"seq_annotation_scores_*.csv.gz\"))\n", "\n", "decompressed_files = []\n", "for gz_file in eval_apply_reversed_data:\n", " csv_file = gz_file.removesuffix('.gz')\n", " with gzip.open(gz_file, 'rb') as f_in:\n", " with open(csv_file, 'wb') as f_out:\n", " shutil.copyfileobj(f_in, f_out)\n", " decompressed_files.append(csv_file)\n", "\n", "# Check for base checkpoint file (optional)\n", "base_checkpoint_files = glob(\"learn/base/thresholds/*.csv\")\n", "base_checkpoint_path = base_checkpoint_files[0] if base_checkpoint_files else None\n", "\n", "family_stats_output_path = os.path.join(eval_conf_dir, \"family_summary_stats.csv\")\n", "checkpoint_output_path = os.path.join(eval_conf_dir, \"family_stats_checkpoint.csv\")\n", "\n", "learn_reverse_decoy_evaluations.execute_all(\n", " base_family_checkpoint=base_checkpoint_path,\n", " eval_apply_data=decompressed_files,\n", " family_stats_output=family_stats_output_path,\n", " checkpoint_output=checkpoint_output_path\n", ")\n", "\n", "for f in decompressed_files:\n", " os.remove(f)\n", "print(f\"Family summary stats created at: {family_stats_output_path}\")" ] }, { "cell_type": "markdown", "id": "6db0029b-8081-4a45-ab95-1d0e45f49bc9", "metadata": {}, "source": [ "### Rule 7: Eval_Conf (Global Confidence)\n", "\n", "In this step, we evaluate how well the **forward-sequence** cosine similarity scores (from Rule 4) align with the true annotations. Using the decoy-derived family statistics from the previous rule, Snekmer converts each sequence’s **delta**, the difference between its top two cosine similarity scores, into a calibrated **global confidence score**.\n", "\n", "**The confidence score reflects the probability that the top-ranked family prediction is correct, based on empirical rates of true positives vs. false positives observed across the training families.**" ] }, { "cell_type": "code", "execution_count": 30, "id": "2b5031d9-8f18-4a15-90e1-683336b9dc78", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Base confidence file not found or multiple files present. Only one file is allowed in baseConfidence.\n", "Global confidence scores created at: learn/output/eval_conf/global_confidence_scores.csv\n" ] } ], "source": [ "# This cell runs the logic from `learn_evaluate_sequences.py`\n", "eval_apply_data = glob(os.path.join(output_dir, \"eval_apply\", \"seq_annotation_scores_*.csv.gz\"))\n", "eval_conf_dir = os.path.join(output_dir, \"eval_conf\")\n", "\n", "decompressed_files = []\n", "for gz_file in eval_apply_data:\n", " csv_file = gz_file.removesuffix('.gz')\n", " with gzip.open(gz_file, 'rb') as f_in:\n", " with open(csv_file, 'wb') as f_out:\n", " shutil.copyfileobj(f_in, f_out)\n", " decompressed_files.append(csv_file)\n", " \n", "family_stats_path = os.path.join(eval_conf_dir, \"family_summary_stats.csv\")\n", "output_glob_path = os.path.join(eval_conf_dir, \"global_confidence_scores.csv\")\n", "\n", "# These are not used in this tutorial, but included for completeness\n", "base_conf_files = glob(\"learn/base/confidence/*.csv\")\n", "modifier = config.get(\"learn_apply\", {}).get(\"conf_weight_modifier\", 20)\n", "\n", "# Initialize the Evaluator class from the imported script\n", "evaluator = learn_evaluate_sequences.Evaluator(\n", " input_data=decompressed_files,\n", " output_glob_path=output_glob_path,\n", " reverse_decoy_stats=family_stats_path,\n", " modifier=modifier,\n", " confidence_data=base_conf_files,\n", " config=config\n", ")\n", "\n", "evaluator.execute_all()\n", "\n", "for f in decompressed_files:\n", " os.remove(f)\n", "print(f\"Global confidence scores created at: {output_glob_path}\")" ] }, { "cell_type": "markdown", "id": "e118a51f", "metadata": {}, "source": [ "### Learn Workflow Output Files (Required for Apply)\n", "\n", "Before running **Apply**, three data files from the **Learn** workflow must be staged in the `apply/` directory:\n", "\n", "---\n", "\n", "#### 1. `counts/kmer_counts_total.csv`\n", "\n", "A consolidated **kmer association matrix**. \n", "Each row is a protein **family**, and columns are **kmers**. \n", "Values are aggregated k-mer frequencies across all sequences of that family.\n", "\n", "**Example (truncated):**\n", "\n", "| \\_\\_index\\_level\\_0\\_\\_ | Sequence count | Kmer Count | CPPPCACC | PPPCACCP | PPCACCPP | … |\n", "|-------------------------|----------------|------------|----------|----------|----------|---|\n", "| Totals | 10000 | 3513388 | 602 | 641 | 707 | … |\n", "| TIGR01024 | 50 | 5962 | 1 | 1 | 1 | … |\n", "| TIGR00089 | 50 | 22869 | 1 | 2 | 2 | … |\n", "| TIGR01171 | 50 | 13316 | 0 | 1 | 0 | … |\n", "\n", "Used during **Apply** to compute cosine similarity between a new sequence’s k-mer profile and known family profiles.\n", "\n", "---\n", "\n", "#### 2. `confidence/global_confidence_scores.csv`\n", "\n", "A calibration table mapping **Difference** (top–second score **delta**) to **confidence** \n", "(the estimated probability the prediction is correct at that difference).\n", "\n", "**Example (header):**\n", "\n", "| Difference | confidence | weight | total_sum | inter_sum | cur_sum |\n", "|------------|------------|--------|----------|----------|---------|\n", "| 0.00 | 0.383 | 10000 | 75 | 75 | 75 |\n", "| 0.01 | 0.655 | 10000 | 71 | 71 | 71 |\n", "| 0.02 | 0.922 | 10000 | 64 | 64 | 64 |\n", "| 0.03 | 0.957 | 10000 | 87 | 87 | 87 |\n", "| 0.04 | 0.983 | 10000 | 152 | 152 | 152 |\n", "| 0.05 | 1.000 | 10000 | 203 | 203 | 203 |\n", "\n", "**Column notes:**\n", "- **weight** — total number of known predictions contributing to the curve (global sample size).\n", "- **total_sum** — raw per-bin count of observations (**T+F**) at that `Difference` value from the data (no smoothing).\n", "- **inter_sum** — linearly interpolated version of `total_sum` to fill gaps and smooth sparsely populated bins.\n", "- **cur_sum** — per-bin **current-run** count of observations (**T+F**). When merging with a prior file, this becomes the prior `cur_sum` **plus** the current run’s per-bin counts.\n", "> Note, These are only required when adding additional data to a pre-existing kmer-counts-matrix. (Not shown in tutorial)\n", "\n", "Used during **Apply** to assign a calibrated confidence to each predicted family.\n", "\n", "---\n", "\n", "#### 3. `stats/family_summary_stats.csv`\n", "\n", "Per-family **reverse-sequence (decoy) statistics**. \n", "These describe decoy behavior and provide reference thresholds to help **Apply** filter low-confidence hits.\n", "\n", "**Example (header):**\n", "\n", "| family | Mean | Std_Dev | Min | 10th_Percentile | 25th_Percentile | Median | 75th_Percentile | 90th_Percentile | Max | 1_Std_Dev_Above | 1_Std_Dev_Below | 2_Std_Dev_Above | 2_Std_Dev_Below |\n", "|-----------|------|---------|------|------------------|------------------|--------|------------------|------------------|------|------------------|------------------|------------------|------------------|\n", "| TIGR01024 | 0.072| 0.031 | 0.003| 0.035 | 0.049 | 0.069 | 0.092 | 0.115 | 0.243| 0.104 | 0.041 | 0.135 | 0.010 |\n", "| TIGR00089 | 0.138| 0.050 | 0.019| 0.080 | 0.101 | 0.131 | 0.169 | 0.208 | 0.479| 0.188 | 0.088 | 0.238 | 0.038 |\n", "| TIGR01171 | 0.083| 0.033 | 0.007| 0.043 | 0.059 | 0.080 | 0.103 | 0.126 | 0.265| 0.115 | 0.050 | 0.148 | 0.018 |\n", "| TIGR01003 | 0.094| 0.034 | 0.013| 0.052 | 0.070 | 0.092 | 0.116 | 0.138 | 0.278| 0.128 | 0.060 | 0.162 | 0.027 |\n", "| TIGR01135 | 0.122| 0.043 | 0.018| 0.069 | 0.092 | 0.120 | 0.147 | 0.178 | 0.382| 0.165 | 0.079 | 0.207 | 0.037 |\n", " \n", "Tip: You can choose operating thresholds like `1_Std_Dev_Above` (above mean), or a percentile (e.g., 90th) depending on precision/recall needs." ] }, { "cell_type": "markdown", "id": "e12b48a0", "metadata": {}, "source": [ "### Intermediate Steps\n", "\n", "Users will have to extract key outputs and copy them into a new directory to run the Apply pipeline. When running from the command line all of the needed inputs will be automatically copied to ```apply_inputs```, which can be moved to the apply dir.\n" ] }, { "cell_type": "code", "execution_count": 31, "id": "a370faba", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'apply/stats/family_summary_stats.csv'" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Where the Learn workflow wrote its outputs\n", "learn_output_dir = os.path.join(\"learn\", \"output\")\n", "eval_conf_dir = os.path.join(learn_output_dir, \"eval_conf\")\n", "# Where Apply expects its inputs (tutorial staging dir)\n", "apply_base = os.path.join(\"apply\")\n", "os.makedirs(os.path.join(apply_base, \"counts\"), exist_ok=True)\n", "os.makedirs(os.path.join(apply_base, \"confidence\"), exist_ok=True)\n", "os.makedirs(os.path.join(apply_base, \"stats\"), exist_ok=True)\n", "# Source files (from Learn)\n", "src_kmer_counts = os.path.join(learn_output_dir, \"learn\", \"kmer_counts_total.csv\")\n", "src_global_conf = os.path.join(eval_conf_dir, \"global_confidence_scores.csv\")\n", "src_family_stats = os.path.join(eval_conf_dir, \"family_summary_stats.csv\")\n", "# Destination files (for Apply)\n", "dst_kmer_counts = os.path.join(apply_base, \"counts\", \"kmer_counts_total.csv\")\n", "dst_global_conf = os.path.join(apply_base, \"confidence\", \"global_confidence_scores.csv\")\n", "dst_family_stats = os.path.join(apply_base, \"stats\", \"family_summary_stats.csv\")\n", "# Copy\n", "shutil.copyfile(src_kmer_counts, dst_kmer_counts)\n", "shutil.copyfile(src_global_conf, dst_global_conf)\n", "shutil.copyfile(src_family_stats, dst_family_stats)" ] }, { "cell_type": "markdown", "id": "44c0aced", "metadata": {}, "source": [ "## Getting Started with Snekmer Apply\n", "\n", "### Setup\n", "\n", "\n", "Before running Snekmer Apply, ensure that all required files are placed in an input directory located at the same level as the config.yaml file. In addition, the Apply pipeline requires key outputs from the Learn pipeline, `kmer counts`, `global confidence scores`, and `family-level reversed-sequence statistics`, each stored in their own subdirectories. The expected directory structure is shown below:\n", "\n", " .\n", " ├── input\n", " │ ├── W.fasta\n", " │ ├── X.fasta\n", " │ ├── Y.fasta\n", " │ ├── Z.fasta\n", " │ └── etc.\n", " ├── config.yaml\n", " ├── counts\n", " │ └── kmer-counts-total.csv\n", " ├── confidence\n", " │ └── global-confidence-scores.csv\n", " └── stats\n", " └── family_summary_stats.csv\n", " \n", " \n", " \n", "Note: Snekmer automatically creates the **_output_** directory when creating output files, so there is no need to create this folder in advance\n", "\n" ] }, { "cell_type": "markdown", "id": "fb45f440", "metadata": {}, "source": [ "## Running Snekmer Apply Pipeline" ] }, { "cell_type": "markdown", "id": "844b5fc0", "metadata": {}, "source": [ "### Rule 1: Preprocess (Vectorize)\n", "\n", "In this step, we kmerize and vectorize the new, unknown sequences using the same alphabet and kmer settings that were used during Learn. This produces .npz files containing the reduced sequences and kmer presence/absence vectors, ensuring the Apply pipeline uses representations that are directly comparable to the trained set." ] }, { "cell_type": "code", "execution_count": 32, "id": "d71dd4ed", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "unzipped files:\t ['apply/input/test_sequences_1.fasta']\n", "output directory:\t apply/output\n", "Vectorized apply/input/test_sequences_1.fasta to apply/output/vector/test_sequences_1.npz\n" ] } ], "source": [ "# 1. Collect all fasta-like files for the APPLY step\n", "input_dir = \"apply/input/\"\n", "\n", "unzipped_apply = []\n", "for ext in config[\"input\"][\"file_extensions\"]:\n", " unzipped_apply.extend(glob(os.path.join(input_dir, f\"*.{ext}\")))\n", "print(\"unzipped files:\\t\", unzipped_apply)\n", "\n", "# 2. Define APPLY output directory (and create if missing)\n", "output_dir = \"apply/output\"\n", "os.makedirs(output_dir, exist_ok=True)\n", "print(\"output directory:\\t\", output_dir)\n", "\n", "# 3. Kmerize/Vectorize the APPLY input files (Snakemake-style)\n", "vector_dir = os.path.join(output_dir, \"vector\")\n", "kmer_dir = os.path.join(output_dir, \"kmerize\")\n", "os.makedirs(vector_dir, exist_ok=True)\n", "os.makedirs(kmer_dir, exist_ok=True)\n", "\n", "for fa in unzipped_apply:\n", " base = os.path.basename(skm.utils.split_file_ext(fa)[0])\n", " out_npz = os.path.join(vector_dir, f\"{base}.npz\")\n", " out_kmer = os.path.join(kmer_dir, f\"{base}.kmers\")\n", "\n", " run_vectorize_like_snakemake(\n", " fasta_path=fa,\n", " alphabet=config[\"alphabet\"],\n", " k=config[\"k\"],\n", " output_npz=out_npz,\n", " output_kmerobj=out_kmer,\n", " basis_path=None,\n", " min_filter=0,\n", " )\n", " print(f\"Vectorized {fa} to {out_npz}\")" ] }, { "cell_type": "markdown", "id": "66c88ae1", "metadata": {}, "source": [ "### Rule 2: Apply\n", "\n", "In this step, we compare each new sequence to the trained model by computing cosine similarity between:\n", "\n", "- the family-level kmer count vectors stored in `kmer-counts-total.csv`, and\n", "\n", "- the kmer count vector for each input sequence produced in the previous rule.\n", "\n", "This produces a predicted annotation for each sequence based on which family yields the highest cosine similarity. In other words, we evaluate how closely each new sequence’s kmer composition matches the learned kmer profiles of all known families.\n", "\n", "**Optional Output** (not shown):\n", "If you set learnapp: save_apply_associations: True in the config file, Snekmer will also generate a full cosine similarity matrix—showing similarity of each sequence to every family. These matrices can be extremely large and may consume substantial storage space.\n" ] }, { "cell_type": "code", "execution_count": 33, "id": "b5d6c72d", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "--- Results for test_sequences_1.npz ---\n", " Sequence Prediction Score delta Confidence\n", "0 tr|A0A2S8EUS7|A0A2S8EUS7_9RHOB TIGR01783 0.199116 0.00 0.383333\n", "1 tr|A0A401ZGP4|A0A401ZGP4_9CHLR TIGR00757 0.315537 0.02 0.921569\n", "2 tr|A0A427BXE3|A0A427BXE3_9GAMM TIGR01023 0.198284 0.08 1.000000\n", "3 tr|J2LWC9|J2LWC9_9BURK TIGR00797 0.210697 0.00 0.383333\n", "4 tr|A0A2Z5TEJ7|A0A2Z5TEJ7_9GAMM TIGR00229 0.225655 0.00 0.383333\n", "... ... ... ... ... ...\n", "2995 tr|I0V6L5|I0V6L5_9PSEU TIGR00456 0.172414 0.01 0.654545\n", "2996 tr|A0A4Q8RSA0|A0A4Q8RSA0_9BACT TIGR04183 0.393345 0.01 0.654545\n", "2997 tr|A0A2T0N360|A0A2T0N360_9ACTN TIGR01733 0.206815 0.01 0.654545\n", "2998 tr|A0A516GAF1|A0A516GAF1_9MICO TIGR01409 0.188977 0.02 0.921569\n", "2999 tr|A0A6I4VTV3|A0A6I4VTV3_9BACL TIGR00738 0.138139 0.03 0.956522\n", "\n", "[3000 rows x 5 columns]\n" ] } ], "source": [ "# This cell runs the logic from `apply.py`\n", "apply_output_dir = os.path.join(output_dir, \"apply\")\n", "os.makedirs(apply_output_dir, exist_ok=True)\n", "\n", "# Get the 3 input files from the /apply/ directory\n", "confidence_associations_path = glob(\"apply/confidence/global_confidence_scores.csv\")[0]\n", "compare_associations_path = glob(\"apply/counts/kmer_counts_total.csv\")[0]\n", "decoy_stats_path = glob(\"apply/stats/family_summary_stats.csv\")[0]\n", "\n", "# Get params from config (learn_apply section)\n", "la_cfg = config.get(\"learn_apply\", {})\n", "selection_type = la_cfg.get(\"selection\", \"top_hit\")\n", "threshold_type = la_cfg.get(\"threshold\", \"Median\")\n", "\n", "vector_dir = os.path.join(output_dir, \"vector\")\n", "for fa in unzipped_apply:\n", " base_name = os.path.basename(skm.utils.split_file_ext(fa)[0])\n", " base_npz = f\"{base_name}.npz\"\n", " input_data_path = os.path.join(vector_dir, base_npz)\n", "\n", " out_name_summary = f\"kmer_summary_{base_name}.csv\"\n", " kmer_summary_out_path = os.path.join(apply_output_dir, out_name_summary)\n", " out_name_seq_ann = f\"seq_annotation_scores_{base_name}.csv\"\n", " seq_ann_out_path = os.path.join(apply_output_dir, out_name_seq_ann)\n", "\n", " # Use the run_apply function from the imported script\n", " apply_script.run_apply(\n", " compare_associations=compare_associations_path,\n", " data=input_data_path,\n", " confidence_associations=confidence_associations_path,\n", " decoy_stats=decoy_stats_path,\n", " seq_ann_out=seq_ann_out_path,\n", " kmer_summary_out=kmer_summary_out_path,\n", " selection_type=selection_type,\n", " threshold_type=threshold_type,\n", " config=config,\n", " )\n", "\n", " print(f\"\\n--- Results for {base_npz} ---\")\n", " print(pd.read_csv(kmer_summary_out_path))" ] }, { "cell_type": "markdown", "id": "006e1df4", "metadata": {}, "source": [ "### Filtering High-Confidence Predictions\n", "\n", "After running the Apply step, Snekmer produces a summary table containing the predicted family, cosine similarity scores, deltas, and confidence values for each sequence. In this section, we load the full Apply results and filter them to keep only high-confidence predictions (confidence ≥ 0.95).\n", "\n", "This provides a quick way to inspect the most reliable annotations and optionally save or further analyze them.\n", "\n", "Note: All sequences will receive a prediction. In small datasets with no overlapping kmers, this will result in a `Score` of 0. These \"predictions\" should be removed as there is no reliable signal for their assignment." ] }, { "cell_type": "code", "execution_count": 34, "id": "44873297", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "High-confidence (>= 0.95) rows for test_sequences_1.npz:\n", " Sequence Prediction Score delta Confidence\n", "2 tr|A0A427BXE3|A0A427BXE3_9GAMM TIGR01023 0.198284 0.08 1.000000\n", "5 tr|A0A6G7WF57|A0A6G7WF57_9LACT TIGR03534 0.297106 0.08 1.000000\n", "10 tr|C5RAN2|C5RAN2_WEIPA TIGR01017 0.290899 0.11 1.000000\n", "11 tr|A0A2P1NPL9|A0A2P1NPL9_9BURK TIGR00593 0.477110 0.14 1.000000\n", "14 tr|A0A1Y0FXC9|A0A1Y0FXC9_9GAMM TIGR00350 0.166781 0.03 0.956522\n", "... ... ... ... ... ...\n", "2990 tr|A0A3N0DZ49|A0A3N0DZ49_9ACTN TIGR00055 0.346216 0.15 1.000000\n", "2991 tr|S2XWT3|S2XWT3_9ACTN TIGR00350 0.279564 0.05 1.000000\n", "2992 tr|A0A1I6MJA1|A0A1I6MJA1_9BACT TIGR00377 0.250285 0.11 1.000000\n", "2994 tr|A0A5B7WSU0|A0A5B7WSU0_9MICC TIGR01060 0.542388 0.32 1.000000\n", "2999 tr|A0A6I4VTV3|A0A6I4VTV3_9BACL TIGR00738 0.138139 0.03 0.956522\n", "\n", "[995 rows x 5 columns]\n" ] } ], "source": [ "results_df = pd.read_csv(\"apply/output/apply/kmer_summary_test_sequences_1.csv\")\n", "print(f\"High-confidence (>= 0.95) rows for {base_npz}:\")\n", "print(results_df[results_df[\"Confidence\"] >= 0.95])" ] }, { "cell_type": "markdown", "id": "b1e1f311", "metadata": {}, "source": [ "## Apply Pipeline is done.\n", "\n", "Output is located in `/output/apply/kmer-summary-{input file name}.csv`.\n", "\n", "Each output file contains five columns:\n", "\n", "- Sequence - The identifier of the sequence being annotated.\n", "\n", "- Prediction - The predicted TIGRFAM family (or other annotation label).\n", "\n", "- Score - The cosine similarity between the sequence and the top-matching family kmer profile.\n", "\n", "- delta - The margin between the highest and second-highest cosine similarity scores (a measure of separation).\n", "\n", "- Confidence - The estimated reliability of the prediction, derived from global thresholds learned during the Learn pipeline. Confidence tends to be higher for families with more training sequences.\n" ] }, { "cell_type": "markdown", "id": "43ce5560", "metadata": {}, "source": [ "### Post-hoc Evaluation of Apply Predictions\n", "\n", "After generating predictions with Snekmer Apply, we can perform a post-hoc assessment by comparing the predicted annotation for each sequence against the known TIGRFAM annotations. This step provides a deeper view into model performance by computing accuracy, error types, per-family precision, and per-family recall.\n", "\n", "Because this demo uses a small training dataset (200 families, 5,000 training sequences), the accuracy and confidence values shown here represent a lower bound of expected performance. In real analyses with with thousands of genomes in the training set, Snekmer’s accuracy and confidence calibration improve substantially.\n", "\n", "The following code evaluates high-confidence predictions (≥ 0.95), matches them to ground-truth annotations, and summarizes correctness, precision, and recall.\n", "\n", "Please note, Snekmer Learn/Apply was not intended to be run on datasets of this size - and as such, will have a high false positive rate. Usually confidence scores are enough to filter unreliable results, but with small datasets, we may need the secondary filters of **score** and **delta**.\n" ] }, { "cell_type": "code", "execution_count": 58, "id": "da2cfa17", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total: 3000 kept: 995\n", "TP:773 FP:87 Filtered(Known):1140 Predicted(Unknown):135 Filtered(Unknown):865\n", "precision: 0.8988 recall: 0.4041\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAhwAAAEiCAYAAACyZgs8AAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjcsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvTLEjVAAAAAlwSFlzAAAPYQAAD2EBqD+naQAAOGxJREFUeJzt3Qm8TPX/x/GPfc2+q4jKUpaiUJZCFBGpJIWS+vlpkWRps1VKC61KfkhUiFJEtrJll+xbESpb9iVb83+8v4/Hmf/M3dx73eNeM6/n43Fcc+bMOd85d+6cz/l8t3SBQCBgAAAAPkrv584BAAAIOAAAwHlBhgMAAPiOgAMAAPiOgAMAAPiOgAMAAPiOgAMAAPiOgAMAAPiOgAMAAPiOgAOIUDfddJNbolHM975161ZLly6djRgxIsWOUbJkSWvXrl2K7S+S6Fz37t07tYuBNIaAAxHtgw8+cF9+1apVs7TolVdesa+//jrZr1+7dq37YtcFNa348ccf3Tn3lkyZMlmpUqWsTZs29ttvv9mF5KeffnLn98CBA5ZWKGgKPb8ZM2a04sWLu+Dnjz/+sLQoLZ5HnH8ZU+GYwHkzevRodye6ePFi27x5s11++eVpLuC46667rFmzZskOOPr06ePu5vU+Q02bNs1S0xNPPGHXXXednTp1ypYvX25DhgyxyZMn26pVq6xYsWLntSwlSpSw48ePu+AnqRdKnV9dzPPkyRP23IYNGyx9+tS7Z+vbt69ddtll9s8//9jChQtdIDJv3jxbvXq1Zc2a1dKShM4jogcZDkSsLVu2uC+6t956ywoWLOiCj2iSOXNmt6SWWrVq2f33328PPvigvfvuu/bGG2/Yvn377JNPPon3NUePHvWlLMoE6CKcIUOGFNtnlixZkhzApKTbbrvNnd+HH37Yhg4dal27drVff/3Vvvnmm1QrE5AQAg5ELAUYefPmtcaNG7ssQlwBh1e3r4uh7sBLly7tLiS6M1+yZEnYtro7y5kzp0tbKyOh/yuQ0Rf9mTNnYl04n376abvkkkvc/sqUKeOOETo5s46r7XQB9tLjXpuA33//3f773/+612XLls3y589vd999d1jVie5otU5uvvnm4D5UpRFfG47du3db+/btrXDhwu4CXKlSpVgBQFLOSVLUrVs3GAiKUuw6jrI09913n/td1axZM7j9qFGjrEqVKu7958uXz+69917bvn17rP16ZdR2119/vc2dOzfWNvG14Vi/fr3dc8897veo1+t8P/fcc8HyPfPMM+7/yiR459f7HcTVhkNVRvqdqLzZs2e36tWru6xOXFVOY8eOtZdfftkuvvhi97uoV6+ey8KdS4AnCjpivkd9/lUmHadq1aqxghJloZSBuOKKK9w2+rzpdzF9+vSztgnSOYiZXQt1tvOoY+hYynzob0q/g2effTbZ5wFpF1UqiFgKMO688053l9+qVSsbPHiwu2DqwhnTZ599ZocPH7ZHH33UfRkOGDDAvVYXkNC7WAUWDRs2dG1CdEGeMWOGvfnmm+6C17FjR7eNgoqmTZvaDz/84C7ulStXtu+//9596SpYGThwoNvu008/dXenukg+8sgjbp32IyqnsjO6yOqCpC9nlV9f+LpA62JWu3ZtV23xzjvvuC/ocuXKudd6P2NSlYJer4vaY4895r78x40b5y4Yqlt/8sknk3VOEsu7EOpiFkoXaF3oVL3kBWS6EL/wwgsuGNA52rNnj8uS6D3//PPPwbT8//73P1e+G264wTp37uzKpnOvi6uCvYSsXLnSXaT1XnT+ddFUGb/99lt3fL3XjRs32ueff+5+ZwUKFHCvU3ASl127drlyHDt2zP1e9D4VzKk8X375pTVv3jxs+1dffdVVyShgPXjwoDu/rVu3tkWLFllyeBdwBW6eNWvW2I033ujaePTo0cNy5MjhAh0FzOPHjw+WSUFB//79g5/HQ4cO2dKlS11V2C233GLnIqHzqPLdfvvtVrFiRVdFpMBWn8/58+ef0zGRRgWACLR06VJduQLTp093j//999/AxRdfHHjyySfDttuyZYvbLn/+/IF9+/YF10+cONGt//bbb4Pr2rZt69b17ds3bB/XXHNNoEqVKsHHX3/9tdvupZdeCtvurrvuCqRLly6wefPm4LocOXK4/cZ07NixWOsWLFjg9jty5MjgunHjxrl1P/zwQ6zt69Sp4xbPoEGD3LajRo0Krjt58mSgRo0agZw5cwYOHTqU5HMSF5VF2w0bNiywZ8+ewJ9//hmYPHlyoGTJku79L1myxG3Xq1cvt12rVq3CXr9169ZAhgwZAi+//HLY+lWrVgUyZswYXK+yFypUKFC5cuXAiRMngtsNGTLE7Tf0vXvvafjw4cF1tWvXDlx00UWB33//Pew4+qx4Xn/9dfc6vT6mEiVKhP3uOnfu7LadO3ducN3hw4cDl112mXvvZ86cCTs/5cqVCyv322+/7dbrfSZE70HbzZgxw53f7du3B7788stAwYIFA1myZHGPPfXq1QtUqFAh8M8//4S9vxtuuCFwxRVXBNdVqlQp0Lhx4wSPG/Pz5NE50LkIpfLp93u28zhw4EC3Xu8DkY8qFURsdkPVBqpqEN2ht2zZ0r744otY1R+i50LvDL30dFy9Kv7zn/+EPda2odt99913rq2A7nJDqYpF38VTpkw5a/mV3g9Nd//999+uwavu7HXXmRwqV5EiRVy2x6O7e5XzyJEjNnv27GSfk7g89NBD7i5WDURVreVVHymln9D5nDBhgv37778uu7F3797gorIrE6LMkegOXFVEen1oWxVlbHLnzp1g2ZQxmTNnjivjpZdeGvacPivJPb/KDoRWC6mKQNkTZR+UmQqlti2h5U7q+a1fv747v8rkqMpE2QtVlSgjJmovM2vWLHcelanyzqM+S8rSbdq0KdirRZ8rZRu07nzyMlUTJ050v3NENgIORBwFFAosFGyovYBStFpUDaK098yZM2O9JuZFx7vQ7t+/P2y96rdjptS1beh2an+hi+xFF10Utp1X1aHnz0bVHy+++GKwDYjS0Dquqj6Ufk8OHVcX7Jg9K+IrV2LPSXxUftXP66Kn6os///zTHnjggVjbqWonlC56CsxUVr3n0GXdunUuyAgtr7YL5XXDTYh3Ub/66qstpag8an8Qk1/n9/3333fnV9U1jRo1csGEPisefeZ1HlU1FfM89urVy23jnUtVZ+izdeWVV1qFChVc9Z9+Z35TUKsqH1Xl6AZBVYiq8iH4iEy04UDE0QXur7/+ckGHlriyHw0aNAhbF1/vhdBGngltl9Ief/xxGz58uGuXUKNGDXfHrjtvfSGfry/jxJ6T+OjCpbvwpGRzRO9P71WZoLjKoKxBJDjX86tsipctUpsMZVbU+FbddXWOvM+J2ogooxEXr5u42sao/YoyDepOrV4vam/x4YcfumBA9DuJq2xxZQwTS797ZZqUtVLj2qlTp9qYMWNcA2OV43z9veH8IOBAxFFAUahQIXcHGJPS9V999ZX7Io15oUvJMR/UmFRp7NAsh3oLeM+fLX2vu9a2bdu6BqkejbcQc+CkpKT/dVzdtepCFJrliKtcqUkNZ3VhU+ZDd9zx8cqrjIjXA8arglJmSz1w4uNlQDRmRUKSen51sY/pfJxfXZjV6FNZvffee881EPXeozI+iQn81NBW1TxaVMWmIESNSb2AQxmYuKp7EpOxS+g86rOoHjpa1IVdjYfVU0hBSGLKjQsHVSqIKKqKUFChlu+q1465qHeGAgE/xypQelt3ffriD6U7Rn3xavwEj+rd4xp9UReQmHeT6qUR825Sr5fEjOCocu3cudPdQXpOnz7t9qs74jp16lhaoF4Nev/qphnzHOix2iCI7u5VPaDg8eTJk8Ft1PX1bOdDr9MFddiwYbZt27ZYx0ju+dUAcwsWLAiuU7sVddtVD5jy5cubn9QDSVmPQYMGueBUQbfWffTRRy7jF1c7Fo93Tj36PCj7ceLEibBAUMFT6Ot++eWXRPUoie88qp1JTOrVJaHHRmQgw4GIokBCAYW6IsZF4yJ4g4Cp/tgPTZo0cXeauktTY0HdaSs9rHS1qki8rq+icSaUDdGdndp96K5ebU0UMKnbrKpSdKHSRUzbxexSqi9nXZxfe+0117ZDdfi629fFJiY1XtTFR40qly1b5i6CyqTogqGLVMw2J6lF5+ell16ynj17uvOn6gKVTVkLZaf0PlRNoDt3badusXrP+n1qG1VFna0Nh6g7saohrr32WrdPnXsdT6n9FStWBH8/ot+lqrN0TP1+vQtoKGUV1PVTAaUa4ipjoEayKpO6oJ6PUUnV9kLdjBV0qTGtsnx6j6re6tChgzsvasekz9OOHTtcwCD6jCk40ftVudUgV58NBegeNbDV51TVM+rurfYfCvauuuoq1402IfGdR7UdUZWKGhUrA6R9ajoCNXwNbXyLCJHa3WSAlNSkSZNA1qxZA0ePHo13m3bt2gUyZcoU2Lt3b7C7pLrtxRSza5+6/6kba0xe985Q6g751FNPBYoVK+aOpS6IOkZol0tZv369656ZLVs2tw+vm+X+/fsDDz74YKBAgQKuy2rDhg3dtjG7YsrHH38cKFWqlOtKGtpFNq5ujLt27QruN3PmzK7LZGhXUUnKOYmL1+1TXXYT4p23+LpEjh8/PlCzZk13zrWULVs20KlTp8CGDRvCtvvggw9c11N1Ca1atWpgzpw5sd57XN1iZfXq1YHmzZsH8uTJ4z43ZcqUCbzwwgth2/Tr1y9QvHjxQPr06cO6dsb1u/j1119d92dvf9dff31g0qRJiTo/8ZUxvm6xXvfiUOp6W7p0abecPn06WKY2bdoEihQp4j6Lei+3336760rrURdulVXl1mdR51rdj9X1OJS6VOuzps+OuiN///33ieoWG995nDlzZuCOO+5wfyfap36qm/TGjRsTPAe4MKXTP6kd9AAAgMhGGw4AAOA7Ag4AAOA7Ag4AAOA7Ag4AAOA7Ag4AAOA7Ag4AAOA7Bv5KBA0FrYmnNPhQcmeSBAAg0mhkDQ22qIELzza4HQFHIijY0KydAAAgtu3bt7sRYhNCwJEI3pDPOqG5cuVKzEsAAIh4hw4dcjfkiZkagYAjEbxqFAUbBBwAAIRLTHMDGo0CAADfEXAAAADfUaUCpHIL79OnT9uZM2f4PQBIszJlymQZMmS4cAOO/v3724QJE2z9+vWWLVs2u+GGG+y1116zMmXKBLf5559/7Omnn7YvvvjCTpw4YQ0bNrQPPvjAChcuHNxm27Zt1rFjR/vhhx8sZ86c1rZtW7fvjBn//+39+OOP1qVLF1uzZo1r4PL8889bu3btzvt7BjwnT560v/76y44dO8ZJAZDm22ioF4qusRdkwDF79mzr1KmTXXfdde4u79lnn7UGDRrY2rVrLUeOHG6bp556yiZPnmzjxo2z3Llz22OPPWZ33nmnzZ8/3z2vO8PGjRtbkSJF7KeffnJf4G3atHHR2CuvvOK22bJli9vmP//5j40ePdpmzpxpDz/8sBUtWtQFMEBqjO2iz6XuGNR/PXPmzIzxAiDNZmL37NljO3bssCuuuCL5mY5AGrJ79+6AijR79mz3+MCBA4FMmTIFxo0bF9xm3bp1bpsFCxa4x999910gffr0gZ07dwa3GTx4cCBXrlyBEydOuMfdunULXHXVVWHHatmyZaBhw4aJKtfBgwfdMfUTSAnHjx8PrF27NnD06FFOKIA079ixY+47S99dyb0+pqlGowcPHnQ/8+XL534uW7bMTp06ZfXr1w9uU7ZsWbv00kttwYIF7rF+VqhQIayKRVkL9Q1W9Ym3Teg+vG28fQCp5Wwj8wFAWpASo2xnTEsp5s6dO9uNN95oV199tVu3c+dOl2rOkydP2LYKLvSct01osOE97z2X0DYKSo4fP+7aj4RSWxEtHm0HRLrKlSsH25Zs2LDBBfKiNlVqW1W6dGm3TulVtY9644037Oabb07lUgO4UKSZgENtOVavXm3z5s1L7aK4Bqd9+vTx/TjdunXz/RhImzQqX7169VywG9q4ufr093w53sJbHjvrNpMmTQqOqHvrrbcGH3vr1FjMWzdlyhS76667bMWKFfHe+ZxtmGOkvAbdJvtyWqcNaJyo7UqWLGlZsmRxn2sFrvpe13IudF24/fbbbevWrW6aiZYtW9rcuXMTfM2gQYPs3nvvdW37kqpr167us967d+84n7/77rtdB4QaNWq4jgcK1HWz7NHrDhw44MqQkBEjRtjXX3/tlrRC7/3aa6+1++67z5f9p4l8rhqC6otMvUxCv6T0YdGHVr+8ULt27Qp+kPRTj2M+7z2X0DYaNTRmdkN69uzpqne8RV+2AP7fTTfdZPv27bP9+/dzWhBmzJgxLhBVUKqOACtXroyVzdaSHGpgfbZgQ3Sx9zLcKWnx4sXuc69gIxJ169bNBUx+ddNP1YBDqVkFG1999ZXNmjXLLrvssrDnq1Sp4nqbqFeJR6ledYP1fuH6uWrVKtu9e3dwm+nTp7tgonz58sFtQvfhbRPfh0YRujeMOcOZA7FNnDjRihcvHmxvBcRUokQJVx23ceNGdxFr0aKFazunKnP1Jvz++++tZs2a7nv++uuvdzecHm2v3hB6TkMieJTlCK1iVzs87aNSpUpWsWJF97ns27dvMBOi7IOCH7UF7NGjhzuO1t1zzz3BYFllUbl0vVBbP/XEiM9HH32UpLt/vQ+Vo0mTJm7/devWdQFLTCqvemsOGzYsmCl68cUX3TVK18WXXnopuO3mzZtdOfV+9V68DMmQIUPskUcecf9XT09lHqdNm+Ye65xoOdu+CxUq5KpOvddFVMChVNuoUaPss88+cylmRaRa1K5C1A22ffv2Ln2lD6MakT744IPuRFWvXt1to260+kU+8MAD9ssvv7gPscbY0L4VOIi6w/72228uetOYHxrHY+zYsa7LLYDEOXLkiPti1qK7V+/LEYiLbgT1fatgwAsORo4c6S6GaiOni/F3333nvtd1DdCFXOu9YRC0funSpS7IiIsu3M2aNXNV4PruV2BRq1YtdzFVJsTLtOii/Prrr7uhFpSh0Dq1RdJ1Qp544gkXiKhcn3zySayb01Aaz6latWpJ+oUvWrTIVZ+sXbvWXdAVtMQ8T7fccou9/PLL9tBDDwXXK7Ovc7ZkyRJX/j/++MOtb926tavWUeZI50nXyN9//90FITNmzAi7oQ59HNpxIr59x3eDHhFtOAYPHhxMz4YaPnx4cFCugQMHupb8io5DB/7yqD+wqmM08JdOlD5UGvjLi+ZEUZw+xAow3n77bVdtM3ToUMbgAJJA9doK6IGE6I5eVdXZs2d3QakyFdKoUaNg4/2pU6e6O/XatWsHX6fveWWvdbFTBsKbKPPRRx+Ns22fLpjKoCjI8F4fX8ZNWQBVj48fP949VlW97vRFx1MDaFHWrmnTpvG+N2U/QjsgxNd+KXS92kPlz58/LCPvUU9KHU/l8wIzj5dJKVCggJUqVcqN26Nzsnz58uA4VDq3yvComun+++9363RzrUBDgZgGzdSNgoIdBVUJ7Vvv3WuCoO0jLuBQlcrZZM2a1d5//323JJS6U6ScEAU1P//8c7LKCQBIHGUWvB5PoUJHqNR3v+7qldk4H90xdbx3333XZcTP5XgKojT6tadgwYL2999/h22zd+/e4MXbu4aF3iCfPn06+FiZGN1Iq0lBzIAjodfFV15lMZR93LRpk9WpU8e9bwVZCnRCG6cntG+9v7jaNkZMo1EAQPRQplp34aENSlXd4V00VVVw+PBhd8FU24S4aCoMXVi9RqRqiOq1j1AmwBvXSVT1omy5N42AfnrjNOl4XvWg2nN888038ZZb7SbUjjD0fais3nH1erUjUTCVGHnz5nXVHcpwhGbl46OmB+pFoloAUZZI2R8vU6T3oioSL5uhNiO9evWKNQ5VQtatWxcr+Im4brEAEtd91W+aa8j7Mk5oHdKexHZfTW2XX365y26oukQXf1VxXHPNNW6dql4UfOjCqsDhtttui/dirQ4HqjZQcKIqlX79+rkGmmqX0aFDB5eRUPuJ7t27u0yC2l94GQGtu+qqq1w1u6rw1RZQmQldpOOjruCqVvQu4OrarmNpPBrtV4vaYlStWjXR5+Kiiy5yVUzNmze3Z555xgUMCdH0HGqX+N5777njqXmABsP0yqNqKa98CnxUXaT1iaEAT1VMamDrh3QabtSXPUcQDfylBqyKmL16xZTAOBzRyxuHQynV0FRnJGEcDkQatYdQZkXtR7z5viLJ1KlTXUcOLTGpqkVtPdQmMrRKJinXR6pUAABIBLVDUdWMLryR6ODBgzZgwADf9h+Zt1YAAPggsdUTF2oPIz+R4QAAAL4j4AAAAL4j4AAAAL4j4AAAAL6j0SiQhhx7voQv+83+0u+J2k4jEmbOnDnY7U0DHRUtWtTNMeHN3ly2bFk3bHLoJFpIGzRipR8SGpsivunpReNRaOhuzYWl3h2ai0RTuWs+E83n8eGHH/o25kMoppy/1rcp55OCgANAGM1VpAGRPG+99ZYblMibtlqDDr3zzjtukiwgMUObxzU/iQKOV199NVkBh4biTqnxa6JhyvmaNWu6Higaxjw1UaUCINH0haUvL00QBSSGRvrU0OIxKXDVCKEKTryROTVbuCZu09DcoTO6etkTjQ6q5zRBJ1PO900TU84nBQEHgDD//e9/w6ahD3X8+HE3tLMuBkBcdCetIEKLhh6Pj6pTNOKuqlc0Db0okOjUqZPLOmiyTa3XXCUeTZSm6d41vDdTzk9PE1POJwVVKgASrFLRZE66cCxcuNA9rl69ugtKgMRUqSjDkRhHjx51F8Vdu3aFDSUeOlma5jzx5kJhyvm1aWLK+aQg4ABwVl4bDsAv3rReCmxD5+pIaIp7ppzPmOpTzicFVSoAgPNOE32pik49n7xgQrOuqiGp588//7QdO3bE+XqmnK+fJqacTwoCDgDAeZcvXz5r06aN63rtNRpV24zNmzfb1Vdf7doJ3Xnnna7dRlzUgPS6665zU85rH6rqU3sQ0ZTzypRoynkdIzFTzntCp5xX1VCjRo2SPeX8Tz/95KacP5vRo0e7qigFBSrP2aac//3335M85byylKmN6ekTgenpkdKYnh5IG6J5yvmkYHp6AADOAVPOnz80GgUARDWmnD8/aMMBAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8Ry8VIA3RtO9+0EBGiaFJnjJnzhwcJlkDKmmgIQ1gpKHNNVaBfmqgpIMHD7q+/Zpsy29du3Z13RfjG1797rvvti5durjyaxtNZDVo0CD33KeffupmHZ08ebIbUCo1aH6QJk2auPOYUtOqAxcaPvkAEpy8TRo0aBDngHjaNjkBh+Z5SKkLr2YW3bdvnws24grg3nvvPfvxxx/dtN2ppXDhwm5wqZEjR9pDDz2UauUAUhNVKgASNHbsWGvfvn2s9T179nSjNGoaew3/LLt377aOHTva7bff7oamVmbBU7JkSTcctWa41DTkp06dsh49erjHGkL6nnvusf3797tt//rrL7dfDU2tIZ3jm09DPvroo+BMmaH69Oljw4YNszlz5gSDDc1cqv21atXKlU/DVf/222/B12habwVbeq5169YuiyOadVPzeojKqeBBTpw4Yfnz53c/z7ZvrVdZgWhFwAEgjKae18Vey5QpU+I9O/3793fVHKpe+e6779y6p556ys1dMWnSJPv5559t6dKlNm7cuOBrNC/GokWL3NwRurhrKGllKDQHRmiAoiogBSKaUvuTTz5xc0HER9kLzacRSlU9X3zxhXtOU3OHWrJkib3yyiu2atUqFyC89tprbr3eqwKU+fPnu+dUNgVE3sBQM2bMsH///dd++eUXF4gowzNv3jyrUqWKZcmSJcF9i7ZbuXKlex0QjahSAZBglYoyHIlx7Ngxd7Heu3eve5wpUyaXAdmwYUNwm3bt2lm6dOnc/7/++mt34R4/frx7rFlDlQURBRhvvPFGMLvQtGnTeI+r7IeqLEIpu6CLvgKf+++/P+w5Vb14GQ/9X1OciwKKli1bWp48edxjZWrUNkQUPOh5nRdNsKXjKZhRm5bQUSrj27eoCilv3rwuU6KZUoFoQ8ABIEVoVkqZOHGia3R68cUXx9pGGZHQ7XVBjqt9SExekBKX7Nmzu4mlQpUtW9YGDhzoAgVlJZR18XgNYiVDhgyuPcnZjqn9qArJq+JRwKEARAHH4MGDE71vlTNbtmxnfb9AJKJKBUCyKHjQBVSZCVEVhNo2KEPi0d18fO0vmjVr5oICZUZEP9esWeP+r4u6qje89hzffPNNvOVQT5rQLIqnXLlyLlOiQGH48OFnfT86prI5XpWH2lt4wVCxYsUsd+7c9uGHH7rtNHW5sidbt261a6+91hLbU0VBzCWXXJKo7YFIQ4YDSEMS2301LVD1QIsWLdxFWVkGteNQr5C+ffu6agZ1r1UQogt3XNkONSBVY0u1v/CyCVqnaou3337bVb8oo6Aqlbp168Zbjrvuusu1I1EgEJMyHbNmzXLlOXPmTII9Y2677TZbvXq1qwpJnz69C2RCgyftX0FGqVKl3GO1DbnmmmvctomdJrx58+aJ3h6INOkCXh4U8dIdj+5uVN+cknWv3bp146xHqYsuushdBHXnHKnjMsQVZPhB7USUWVH1hgKctKpWrVo2ZMgQl3kBLjTKZm7ZssW1UQqtOkzK9ZFQG8AFX7Wjqhl9GaZVqk5RI1SCDUSzyLy1AhBVQnuKpEVqZBrXWCFANCHDAQAAfEfAAaQCNZ3yFgBI61Liu4oqFSAVqAvo8ePH3SRjamilMRsiTcyxMQBcuMHGnj17XG8yDeiXXAQcQCrQYFRz5851w3mre2UkdpVUQAUgMqRLl871PDuXmyMCDiAVL8iaR0TzcOiuIaHRNC9EzzzzTGoXAUAK0XfUuWZiCTiAVKbBr7REmtC++gAQeXlcAACQ5qRqwDFnzhxr0qSJG21R6WTNHhnKm1kydLn11lvDttm3b5+1bt3aNbzTLI/t27d3Iw+G0pTQGuVPd1yax2DAgAHn5f0BAIA0EHAcPXrUTfX8/vvvx7uNAgxN3uQtn3/+edjzCjY04dP06dPdPAcKYh555JGwYVc110OJEiVs2bJl9vrrr1vv3r3dEMMAAOD8SNU2HJosSUtC1KBOrfjjsm7dOjch0pIlS6xq1apunaa7btSokb3xxhsuczJ69Gg3m6VmntRkUpoYasWKFfbWW2+FBSYAACCK23D8+OOPVqhQIStTpoybi+Dvv/8OPqfJmlSN4gUb3oyO6mK4aNGi4Da1a9d2wYanYcOGbjrr/fv3x3lMNeBTZiR0AQAAERpwqDpl5MiRNnPmTHvttdds9uzZLiOiaaZl586dLhgJpZk38+XL557zttE8BqG8x942MfXv39/NfuctavcBAAAitFvsvffeG/y/BkiqWLGilS5d2mU9/JysqWfPntalS5fgY2U4CDoAAIjQDEdMpUqVsgIFCtjmzZvdY7Xt2L17d9g2p0+fdj1XvHYf+qmpoUN5j+NrG6J2I+r1EroAAIAoCTh27Njh2nAULVrUPa5Ro4abi0K9TzyzZs1yw0ZXq1YtuI16rpw6dSq4jXq0qE1I3rx5U+FdAAAQfVI14NB4GeoxokW2bNni/r9t2zb3nIZGXrhwoW3dutW147jjjjvs8ssvd40+pVy5cq6dR4cOHdwQ0fPnz7fHHnvMVcWoh4rcd999rsGoxudQ99kxY8bY22+/HVZlAgAAIjjgWLp0qV1zzTVuEQUB+v+LL77oxmzXgF1Nmza1K6+80gUMVapUcRNeqcrDo26vZcuWdW061B22Zs2aYWNsqNHntGnTXDCj1z/99NNu/3SJBQAgShqN3nTTTW7a2/h8//33Z92HeqR89tlnCW6jxqYKVAAAQOq4oNpwAACACxMBBwAA8B0BBwAA8B0BBwAA8B0BBwAA8B0BBwAASJsBx/Lly23VqlXBxxMnTrRmzZrZs88+66aCBwAAOOeA49FHH7WNGze6///2229uZM/s2bPbuHHjrFu3bsnZJQAAiGDJCjgUbFSuXNn9X0FG7dq13eBbI0aMsPHjx6d0GQEAQDQGHBodVBOkyYwZM9yQ4qIp3Pfu3ZuyJQQAANEZcFStWtVeeukl+/TTT2327NnWuHFjt17zlRQuXDilywgAAKIx4Bg0aJBrOKqZWZ977jk3g6t8+eWXdsMNN6R0GQEAQDRO3qbJ0EJ7qXhef/11N8srAABAiozDceDAARs6dKj17NnT9u3b59atXbvWdu/endxdAgCACJWsDMfKlSutXr16lidPHtu6dat16NDBTRM/YcIE27Ztm40cOTLlSwoAAKIrw9GlSxd78MEHbdOmTZY1a9bgevVWmTNnTkqWDwAARGvAsWTJEjf4V0zFixe3nTt3pkS5AABAtAccWbJksUOHDsU5IFjBggVTolwAACDaA46mTZta37597dSpU+5xunTpXNuN7t27W4sWLVK6jAAAIBoDjjfffNOOHDlihQoVsuPHj1udOnXcWBwXXXSRvfzyyylfSgAAEH29VHLnzm3Tp0+3+fPn2y+//OKCj2uvvdbq16+f8iUEAADRGXB4brzxRrcAAACkeJXKE088Ye+8806s9e+995517tw5ObsEAAARLFkBh6agjyuzoXlUNJ8KAADAOQccf//9t2vHEVOuXLmYnh4AAKRMwKEeKVOnTo21fsqUKVaqVKnk7BIAAESwjMkd2lxT0+/Zs8fq1q3r1s2cOdN1l9XU9QAAAOcccDz00EN24sQJN+ZGv3793LqSJUva4MGDrU2bNsnZJQAAiGDJ7hbbsWNHtyjLkS1bNsuZM2fKlgwAAESMcxqHQ5g7BQAA+NJodNeuXfbAAw9YsWLFLGPGjJYhQ4awBQAA4JwzHO3atXOTtb3wwgtWtGhRN3kbAABAigYc8+bNs7lz51rlypWT83IAABBlklWlcskll1ggEEj50gAAgIiUrIBDY2306NHDtm7dmvIlAgAAESdZVSotW7a0Y8eOWenSpS179uyWKVOmsOf37duXUuUDAADRGnAwmigAAPA94Gjbtm1yXgYAAKJUstpwyK+//mrPP/+8tWrVynbv3h2cvG3NmjUpWT4AABCtAcfs2bOtQoUKtmjRIpswYYIdOXLErf/ll1+sV69eKV1GAAAQjQGHeqi89NJLNn36dMucOXNwvWaOXbhwYUqWDwAARGvAsWrVKmvevHms9YUKFbK9e/emRLkAAEC0Bxx58uSxv/76K9b6n3/+2YoXL57o/cyZM8eaNGni5mTR8Ohff/112PMaXOzFF190w6drRtr69evbpk2bYnXBbd26teXKlcuVq3379sEqHs/KlSutVq1aljVrVjdo2YABA5L8ngEAwHkOOO69917r3r277dy50wUK//77r82fP9+6du1qbdq0SfR+jh49apUqVbL3338/zucVGLzzzjv24YcfuvYiOXLksIYNG9o///wT3EbBhhqqqnpn0qRJLoh55JFHgs8fOnTIGjRoYCVKlLBly5bZ66+/br1797YhQ4Yk560DAIBkSBdIxhjlJ0+etE6dOtmIESPszJkzbsZY/bzvvvvcuuTMGKvA5auvvrJmzZq5xyqWMh9PP/20C2Tk4MGDVrhwYXcMBT3r1q2z8uXL25IlS6xq1apum6lTp1qjRo1sx44d7vWDBw+25557zgVHXnsTtUFRNmX9+vWJKpuClty5c7vjK5OSUrp165Zi+wLSGjKJQOQ7lITrY7IyHLpwf/zxx65rrLIKo0aNchfvTz/9NMWmp9+yZYsLElSN4tGbqlatmi1YsMA91k9Vo3jBhmj79OnTu4yIt03t2rXDGrcqS7Jhwwbbv39/ipQVAAD4MPCX59JLL3WLHxRsiDIaofTYe04/1VA1lLIt+fLlC9vmsssui7UP77m8efPGOvaJEyfcEhrBAQCA8xxwPPTQQwk+P2zYMLuQ9e/f3/r06ZPaxQCQCmbNmsV5R8SqW7duqh07WVUqqooIXTTSqP5INQjYgQMHUqRgRYoUcT937doVtl6Pvef00xvl1HP69GnXcyV0m7j2EXqMmHr27Onqo7xl+/btKfKeAACIVsnKcKhxZ0zqqdKxY0c3g2xKUDWIAoKZM2da5cqVg1Ubapuh40iNGjVcgKPeJ1WqVHHrFPioLGrr4W2jRqOnTp0KzmqrHi1lypSJszpFsmTJ4hYAAJDKc6nE2lH69NalSxcbOHBgol+j8TJWrFjhFq+hqP6/bds212ulc+fObkTTb775xg02pi636nni9WQpV66c3XrrrdahQwdbvHix65r72GOPuR4s2k7Uc0YNRjU+h7rPjhkzxt5++21XVgAAcAE0Go1JvVZUpZFYS5cutZtvvjn42AsCNButur6q26jG6tC4Gspk1KxZ03V71QBentGjR7sgo169ei7oadGihRu7I7Rny7Rp01w3XmVBChQo4AYTCx2rAwAApMGAI2Z2QGNmaOTRyZMnJ2nq+ptuusm9Nj7KcvTt29ct8VGPlM8++yzB41SsWNHmzp2b6HIBAIA0EHBoCPNQyiwULFjQ3nzzzbP2YAEAANEnWQHHDz/8kPIlAQAAESvFGo0CAACkaIbjmmuuce0rEmP58uXJOQQAAIj2gENdUT/44AM3cZrGuZCFCxe6bqcaI0NTyQMAAJxTwLFnzx574oknrF+/fmHre/Xq5UblvNCHNgcAAGmgDce4cePcIFwx3X///TZ+/PiUKBcAAIj2gENVJhrVMyatCx2UCwAAINlVKhpyXG011CD0+uuvd+s0x4mqUl544QXOLAAAOPeAo0ePHlaqVCk3J8moUaOC85oMHz7c7rnnnuTsMio9vPvN1C4C4KMBnF0A5z6XigILggsAAODrwF+aTG3o0KH27LPP2r59+9w6VbH88ccfyd0lAACIUMnKcKxcudLq16/vZmLdunWrPfzww24StQkTJrip5UeOHJnyJQUAANGV4dBsse3atbNNmzaF9Upp1KiRzZkzJyXLBwAAojXgWLJkiT366KOx1hcvXtx27tyZEuUCAADRHnBkyZLFDh06FGv9xo0b3TT1AAAA5xxwNG3a1Pr27WunTp1yjzWRm9pudO/e3Vq0aJGcXQIAgAiWrIDjzTfftCNHjlihQoXs+PHjVqdOHStdurTlzJnTXn755ZQvJQAAiL5eKuqdMn36dJs3b57rsaLgo0qVKlavXr2UL2EEq1vnmdQuAuCbHZxbAMnNcCxYsMAmTZoUfFyzZk3LkSOHm6q+VatW9sgjj9iJEyeSsksAABAFkhRwqN3GmjVrgo9XrVplHTp0sFtuucUNd/7tt99a//79/SgnAACIloBjxYoVYdUmX3zxhZu87eOPP3Zjc7zzzjs2duxYP8oJAACiJeDYv3+/FS5cOPh49uzZdttttwUfX3fddbZ9+/aULSEAAIiugEPBxpYtW9z/T5486eZOqV69evD5w4cPW6ZMmVK+lAAAIHoCDg1drrYac+fOtZ49e1r27NmtVq1awefVY0XdYwEAAJLdLbZfv3525513unE3NObGJ598YpkzZw4+P2zYMGvQoEFSdgkAAKJAkgKOAgUKuMnZDh486AKODBkyhD0/btw4tx4AACBFBv6Ki6aoBwAASJGhzQEAAJKCgAMAAPiOgAMAAPiOgAMAAPiOgAMAAPiOgAMAAPiOgAMAAPiOgAMAAPiOgAMAAPiOgAMAAPiOgAMAAPiOgAMAAPiOgAMAAPiOgAMAAER3wNG7d29Lly5d2FK2bNng8//884916tTJ8ufPbzlz5rQWLVrYrl27wvaxbds2a9y4sWXPnt0KFSpkzzzzjJ0+fToV3g0AANEro6VxV111lc2YMSP4OGPG/y/yU089ZZMnT7Zx48ZZ7ty57bHHHrM777zT5s+f754/c+aMCzaKFCliP/30k/3111/Wpk0by5Qpk73yyiup8n4AAIhGaT7gUIChgCGmgwcP2v/+9z/77LPPrG7dum7d8OHDrVy5crZw4UKrXr26TZs2zdauXesClsKFC1vlypWtX79+1r17d5c9yZw5cyq8IwAAok+arlKRTZs2WbFixaxUqVLWunVrV0Uiy5Yts1OnTln9+vWD26q65dJLL7UFCxa4x/pZoUIFF2x4GjZsaIcOHbI1a9akwrsBACA6pekMR7Vq1WzEiBFWpkwZVx3Sp08fq1Wrlq1evdp27tzpMhR58uQJe42CCz0n+hkabHjPe8/F58SJE27xKEABAAARGnDcdtttwf9XrFjRBSAlSpSwsWPHWrZs2Xw7bv/+/V1wAwAAoqRKJZSyGVdeeaVt3rzZtes4efKkHThwIGwb9VLx2nzoZ8xeK97juNqFeHr27OnaiHjL9u3bfXk/AABEiwsq4Dhy5Ij9+uuvVrRoUatSpYrrbTJz5szg8xs2bHBtPGrUqOEe6+eqVats9+7dwW2mT59uuXLlsvLly8d7nCxZsrhtQhcAABChVSpdu3a1Jk2auGqUP//803r16mUZMmSwVq1auW6w7du3ty5duli+fPlcUPD444+7IEM9VKRBgwYusHjggQdswIABrt3G888/78buUFABAADOjzQdcOzYscMFF3///bcVLFjQatas6bq86v8ycOBAS58+vRvwS4081QPlgw8+CL5ewcmkSZOsY8eOLhDJkSOHtW3b1vr27ZuK7woAgOiTLhAIBFK7EGmdeqkoo6L2HClZvXLx8B4pti8grdnx4Kt2IZo1a1ZqFwHwjTduVWpcHy+oNhwAAODCRMABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8R8ABAAB8l9H/QwDAhePVqcdTuwiAb+rWtVRDhgMAAPiOgAMAAPiOgAMAAPiOgAMAAPiOgAMAAPiOgAMAAPiOgAMAAPiOgAMAAPiOgAMAAPguqgKO999/30qWLGlZs2a1atWq2eLFi1O7SAAARIWoCTjGjBljXbp0sV69etny5cutUqVK1rBhQ9u9e3dqFw0AgIgXNQHHW2+9ZR06dLAHH3zQypcvbx9++KFlz57dhg0bltpFAwAg4kXF5G0nT560ZcuWWc+ePYPr0qdPb/Xr17cFCxbE2v7EiRNu8Rw8eND9PHToUIqW69/j/38MINKk9N/L+XL6xLHULgJwwfxdevsLBAJn3TYqAo69e/famTNnrHDhwmHr9Xj9+vWxtu/fv7/16dMn1vpLLrnE13ICkSR3p0GpXQQAMeR+x3xx+PBhy507d4LbREXAkVTKhKi9h+fff/+1ffv2Wf78+S1dunSpWjYkPwpXwLh9+3bLlSsXpxFIA/i7vPAps6Fgo1ixYmfdNioCjgIFCliGDBls165dYev1uEiRIrG2z5Ili1tC5cmTx/dywn8KNgg4gLSFv8sL29kyG1HVaDRz5sxWpUoVmzlzZljWQo9r1KiRqmUDACAaREWGQ1RF0rZtW6tatapdf/31NmjQIDt69KjrtQIAAPwVNQFHy5Ytbc+ePfbiiy/azp07rXLlyjZ16tRYDUkRmVRFpjFYYlaVAUg9/F1Gl3SBxPRlAQAAOAdR0YYDAACkLgIOAADgOwIOAADgOwIOAECK0Gzc6gEIxIWAA2lOu3btrFmzZmHrvvzyS8uaNau9+eabqVYuIFLddNNN1rlz51jrR4wYwaCHSDFR0y0WF66hQ4dap06d3Ay/jJsCABcmMhxI0wYMGGCPP/64ffHFF8FgQ3djTzzxhHXr1s3y5cvnhqfv3bt32Ou2bdtmd9xxh+XMmdMNm3zPPfcEh7bX7L8a6n7p0qXBUWe1n+rVqwdfP2rUqOBkfVu3bnVz6EyYMMFuvvlmy549u1WqVCnOmYaBSM88vvHGG1a0aFE3t5RuBE6dOpXgzYKmhfBGeeZvN7oRcCDN6t69u/Xr188mTZpkzZs3D3vuk08+sRw5ctiiRYtcUNK3b1+bPn16MIBQsKEJ92bPnu3W//bbb27wN2/cfw389uOPP7rHq1atcgHFzz//bEeOHHHr9Lo6deqEHfO5556zrl272ooVK+zKK6+0Vq1a2enTp8/T2QBS3w8//GC//vqr+6m/QVW5aImL/i579Ohh06ZNs3r16gXX87cbxTTwF5CWtG3bNpA5c2YNSBeYOXNmrOfr1KkTqFmzZti66667LtC9e3f3/2nTpgUyZMgQ2LZtW/D5NWvWuP0tXrzYPe7SpUugcePG7v+DBg0KtGzZMlCpUqXAlClT3LrLL788MGTIEPf/LVu2uNcOHTo01v7WrVvnyzkAzif9TT355JOx1g8fPjyQO3fu4N9liRIlAqdPnw4+f/fdd7u/HY+eHzhwYKBbt26BokWLBlavXh3rOPztRi8yHEiTKlas6Fq8azhyL+sQ8/lQSvHu3r3b/X/dunWuOsSrEpHy5cu71K6eE2Uv5s2bZ2fOnHHZDKV6tSjr8eeff9rmzZvd4/iOqeOJd0wgGlx11VWuOjKuvzuPGnZ//PHH7u9L28fE3270IuBAmlS8eHF38f/jjz/s1ltvtcOHD4c9nylTprDHqhJRVUpi1a5d2+1z+fLlNmfOnLCAQwFIsWLF7Iorroj3mDqeJOWYQFqldk5q2xTTgQMHwqYeT8zfXa1atVwgP3bs2DiPxd9u9CLgQJpVokQJd/HXZHtxBR3xKVeunG3fvt0tnrVr17ovT2U6RNkO3Wm999577guwbNmyLghROw61GYnZfgOIZGXKlHHBd0xap/ZKSaHZuKdMmWKvvPKKa2CaFPztRjYCDqRpqhZR1kFp24YNG9qhQ4fO+pr69etbhQoVrHXr1u4Lc/HixdamTRsXRFStWjW4nTIao0ePDgYX6qmiL7wxY8YQcCCqdOzY0TZu3Oh6f61cudI2bNhgb731ln3++ef29NNPJ3l/N9xwg3333XfWp0+fJA0Ext9uZCPgQJp38cUXu6Bj7969iQo6lOadOHGi5c2b12Ut9CVWqlQpF0iEUqCh1G9oWw39P+Y6INLp70NVi+vXr3d/L9WqVXNVIuPGjXPZxeSoWbOmTZ482Z5//nl79913E/Ua/nYjG9PTAwAA35HhAAAAviPgAAAAviPgAAAAviPgAAAAviPgAAAAviPgAAAAviPgAAAAviPgAAAAviPgAAAAviPgAAAAviPgAAAAviPgAAAA5rf/A0RP2Cyh8BBjAAAAAElFTkSuQmCC", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "################################ \n", "conf_cutoff = .95 # confidence (0-1) \n", "score_cutoff = None # cosine similarity score; None=skip \n", "delta_cutoff = None # top1-top2; None=skip \n", "# ################################\n", "\n", "apply_csv = \"apply/output/apply/kmer_summary_test_sequences_1.csv\"\n", "ann_path = \"../../resources/demo_sequences/learn_apply_inputs/annotations/TIGRFAMs_annotation.ann\"\n", "\n", "# --- load & prep ---\n", "df = pd.read_csv(apply_csv)\n", "df.columns = [c.strip().capitalize() for c in df.columns]\n", "df[\"Accession\"] = df[\"Sequence\"].str.split(\"|\").str[1].fillna(df[\"Sequence\"])\n", "ann = pd.read_csv(ann_path, sep=\"\\t\").rename(columns={\"id\":\"Accession\",\"family\":\"Truefamily\"})\n", "df = df.merge(ann, on=\"Accession\", how=\"left\")\n", "known, pred = df[\"Truefamily\"].notna(), df[\"Prediction\"].notna()\n", "\n", "# kept (positives) vs filtered (negatives)\n", "kept = pd.Series(True, index=df.index)\n", "if conf_cutoff is not None: kept &= df[\"Confidence\"] >= conf_cutoff\n", "if score_cutoff is not None: kept &= df[\"Score\"] >= score_cutoff\n", "if delta_cutoff is not None: kept &= df[\"Delta\"] >= delta_cutoff\n", "corr = known & pred & (df[\"Prediction\"] == df[\"Truefamily\"])\n", "\n", "# counts\n", "TP = int((kept & corr).sum())\n", "FP = int((kept & known & pred & ~corr).sum())\n", "FK = int((~kept & known).sum()) # Filtered (Known)\n", "UP = int((~known & kept & pred).sum()) # Predicted (Unassigned)\n", "FU = int((~known & ~kept).sum()) # Filtered (Unassigned)\n", "\n", "# metrics (on knowns)\n", "prec = TP/(TP+FP) if TP+FP else np.nan\n", "rec = TP/(TP+FK) if TP+FK else np.nan\n", "print(f\"total: {len(df)} kept: {int(kept.sum())}\")\n", "print(f\"TP:{TP} FP:{FP} Filtered(Known):{FK} Predicted(Unknown):{UP} Filtered(Unknown):{FU}\")\n", "print(\"precision:\", np.round(prec,4) if np.isfinite(prec) else prec,\n", " \" recall:\", np.round(rec,4) if np.isfinite(rec) else rec)\n", "\n", "# --- plot (stacked bars) ---\n", "colors = {\"TP\":\"#1b9e77\",\"FP\":\"#d95f02\",\"Filtered (Known)\":\"#757575\",\n", " \"Predicted (Unknown)\":\"#4575b4\",\"Filtered (Unknown)\":\"#bdbdbd\"}\n", "fig, ax = plt.subplots(figsize=(5.5,3))\n", "bot=0\n", "for k,v in {\"TP\":TP,\"FP\":FP,\"Filtered (Known)\":FK}.items():\n", " ax.bar(\"Known\", v, bottom=bot, color=colors[k], label=k); bot+=v\n", "bot=0\n", "for k,v in {\"Predicted (Unknown)\":UP,\"Filtered (Unknown)\":FU}.items():\n", " ax.bar(\"Unknown\", v, bottom=bot, color=colors[k], label=k); bot+=v\n", "ax.set_title(\"Annotation Prediction Results\"); ax.set_ylabel(\"Sequences\")\n", "h,l = ax.get_legend_handles_labels(); u = dict(zip(l,h))\n", "ax.legend(u.values(), u.keys(), ncol=2, fontsize=8)\n", "plt.tight_layout(); plt.show()\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.1" } }, "nbformat": 4, "nbformat_minor": 5 }