Performance Model

Why BlazeRules is fast — columnar SIMD, dictionary IDs, derived-column reuse — and the honest ceilings: JSON scan floor, scalar operators, in-process windows.

This page explains where BlazeRules' speed comes from, where the real ceilings are, and how to get the most out of it. The goal is an accurate mental model, not benchmark marketing.

📘

Benchmarks are measured, not promised

Throughput figures referenced in this project are measured on an Apple M1 reference machine and are characterizations, not guarantees. Your numbers depend on CPU, build flags, rule mix, batch size, and input format. Measure on your own hardware before sizing.

Why it is fast

  • Columnar, morsel-parallel SIMD. Each operator scans a whole column with vectorized kernels split into morsels, instead of interpreting one record at a time.
  • Dictionary IDs. Categorical and entity columns are dictionary-encoded to int32, so set and equality tests become integer comparisons.
  • Compile-once immutable plans. Rules are compiled once into an immutable plan; evaluation does no parsing or planning on the hot path.
  • Shared-predicate reuse. Predicates shared across rules are evaluated once — a common scan is not repeated per rule.
  • Derived columns computed once per batch. Window aggregates, model_score, and vector_distance are computed up front and injected as columns, then read by ordinary operators. Zero cost when no rule uses them.
  • Bit-packed kernels and a dense window store. NEON bit-packing and a dense per-entity window store keep the vectorized path tight.
  • JSON projection pushdown. Only rule-referenced fields are materialized from JSON.

The honest ceilings

  • JSON byte-scan floor. Even with projection pushdown, the JSON parser must scan the bytes it skips. Large unused JSON fields cost throughput; trim them or feed Arrow.
  • Scalar operator families. The vectorized families are numeric, range, set, null/empty, bitfield, and closed-enum array bitset. The cross-field, string, regex (RE2), IP/CIDR, temporal, geo, and lookup families are correct but scalar — their cost scales with the number of rows. Heavy use of these families lowers throughput relative to a numeric/set-only rule mix.
  • In-process windows. The window store is in-process and not durable or distributed. It is not a substitute for a stateful stream processor's exactly-once, fault-tolerant state. Keep entity affinity so each entity's history stays on one stream.
  • Single-process ONNX inference. model_score inference runs inline as a derived column; a heavy model is the most likely component to dominate per-batch time.

Arrow versus JSON

Arrow input skips JSON parsing entirely and is the faster path when your upstream data is already typed. Use evaluate_batch(arrow_batch) for typed pipelines and reserve evaluate_ndjson(bytes) for raw JSON streams.

Practical guidance

The following is the project's performance guidance, applied directly:

  • Use Release builds.
  • Batch records; never call the engine per record.
  • Prefer Arrow when upstream data is already typed.
  • Use evaluate_ndjson(bytes_blob) for JSON streams.
  • Use evaluate_ndjson_padded(...) or evaluate_ndjson_file(...) when input is already simdjson-padded or memory-mapped.
  • Keep streaming batches sized for latency, commonly 2K–64K rows; use larger batches for throughput benchmarks.
  • Use OutputDetail.DECISIONS unless downstream code needs per-rule masks (OutputDetail.BITMASKS).
  • Keep partition/entity affinity for window-heavy streaming workloads.
  • Avoid huge unused JSON fields when chasing JSON throughput — skipped bytes are still bytes the parser must scan.

A note on AVX-512

🚧

AVX-512 is opt-in for a reason

AVX-512 kernels are compiled in full builds and still runtime-gated by CPU/OS feature checks. Some server CPUs reduce clock frequency under wide vectors and can end up slower overall, so measure against AVX2 on your target before forcing AVX-512.

Where to go next