Bioinformatics Pattern Matching Engine
A deterministic, sanitizer-clean, dependency-free sequence-alignment engine that an in-house team can drop into a pipeline tomorrow morning and run in front of regulators by tomorrow afternoon. BPME ships as source code under a perpetual, named-licensee proprietary licence. No runtime to license per-seat. No cloud component. No telemetry. No surprises in the audit.
The same engine is exposed through three coordinated interfaces — a C++20 native API, a stable C ABI, and a pure-ctypes Python binding — so it fits any procurement stack from a Rust CLI to a Jupyter notebook to a regulated clinical pipeline.
Why teams choose BPME over what they already have
| What teams hit | What BPME delivers | |
|---|---|---|
| Reproducibility | Different hosts produce subtly different output; hard to audit | Bit-identical index files across machines and runs. Every index file embeds a SHA-256 manifest of its input. A single bpme verify confirms an index was built from the FASTA you think it was. |
| Scale on similar genomes | Aligning against many near-identical references blows up RAM linearly | Dual-mode index. The pangenome-aware storage layer scales with the similarity of the input, not the number of genomes. |
| Integration friction | Python bindings break across versions; FFI is fragile | One C ABI, opaque handles, status codes, thread-local error strings. Bindings work from any language with a C FFI. Bundled Python wrapper has zero third-party dependencies. |
| Licensing exposure | The open-source incumbent is GPL or restricted-use | Proprietary source licence with a clean grant. Ship the engine inside your product without infecting your stack. |
| Trust | Hard to defend an open-source binary in a regulated audit | Source-available. Sanitizer-clean. Fuzz-harnessed loader. Deterministic builds. Versioned, magic-numbered on-disk format. |
Performance, measured
All numbers come from the bundled benchmark suite, single-threaded on commodity hardware. Every number is reproducible by the buyer's own engineers on day one.
- ~2,000× faster exact pattern search than the C++ Standard Library's substring search on the same input. Sub-microsecond per 30-mer query, independent of reference size.
- Sub-microsecond locate at standard sampling settings — the per-hit cost a downstream variant caller or coverage tool actually feels.
- Up to 2.5× less RAM for pangenome-style references (multiple highly similar genomes) compared with the classical mode, with a build that is roughly 2.7× faster on the same workload.
- Lockstep batched search for high-throughput pipelines: process thousands of queries in interleaved fashion against a single read-only index. Scales near-linearly across the bundled thread pool.
- Memory-mapped indexes: queries run directly out of the OS page cache. There is no RAM ceiling on the reference. Multi-gigabyte indexes are first-class.
Technical specifications
| Language | C++20 |
| Build system | CMake 3.20+ |
| Runtime dependencies |
libpthread only |
| Platforms | Linux x86-64 (reference), macOS, Windows (POSIX paths via CMake) |
| Alphabet | DNA (A, C, G, T, N) with IUPAC ambiguity codes and the standard sentinel |
| Index size | Indexes are memory-mapped; tested at multi-gigabyte scale |
| Determinism | Byte-identical artefacts across hosts; no PRNG in the build path |
| File format | Versioned, magic-numbered, endian-explicit, content-hashed |
| Threading | Bundled thread pool; lock-free upgrades on the roadmap |
| Audit features | SHA-256 input hash embedded in every index; bpme verify CLI for round-trip integrity check |