Backtesting

Replay historical Parquet or Arrow batches through a candidate ruleset and compare it against the one you run today.

Because BlazeRules is the same engine live and offline, the rules you run in production can be replayed over historical data without a second implementation. Backtesting means pointing a candidate ruleset at recorded records, then comparing its decisions and scores against the current ruleset before you promote the change.

📘

One engine for stream and backtest

There is no separate "backtest mode". You load a ruleset and call the same evaluate_batch / evaluate_ndjson methods you use live. The only difference is the source of records: historical Parquet or Arrow instead of a live queue.

The idea

A typical change-safety workflow looks like this:

  1. Read historical records as Arrow RecordBatch objects.
  2. Evaluate each batch through two engines — one loaded with the current production rules, one with your candidate rules.
  3. Compare the per-record decisions, scores, and match counts. Differences are exactly the records your change would have routed differently.

The result fields you compare are the standard ones documented in Observability: decisions, decision_codes, scores, risk_bands, winning_rule_ids, and match_counts. The grouped-index helpers make a diff cheap:

import blazerules

current = blazerules.RuleEngine()
current.load_rules("rules.yaml")

candidate = blazerules.RuleEngine()
candidate.load_rules("rules-candidate.yaml")

# For each historical batch:
cur = current.evaluate_batch(batch)
cand = candidate.evaluate_batch(batch)

# Records the candidate would route differently:
cur_groups = cur.grouped_decision_indices()
cand_groups = cand.grouped_decision_indices()
# Compare cur_groups vs cand_groups, or diff cur.decisions vs cand.decisions.

Reading historical data

Use whatever already produces typed Arrow batches. In full builds, blazerules_io.read_record_batches(path, batch_size=...) reads files into batches; in custom lean builds, read Parquet with pyarrow and pass the batches in.

   import blazerules
   import blazerules_io

   engine = blazerules.RuleEngine()
   engine.load_rules("rules-candidate.yaml")

   for batch in blazerules_io.read_record_batches("history.parquet", batch_size=16384):
       result = engine.evaluate_batch(batch)
       # accumulate result.decisions / result.scores for comparison

Native backtest API

For Parquet history, use the built-in A/B comparison API.

report = engine.backtest(
    parquet_path=["history/day-1.parquet", "history/day-2.parquet"],
    rules_a="rules.yaml",
    rules_b="rules-candidate.yaml",
    label_column="fraud_label",
)

print(report.total_records)
print(report.fire_rate_a, report.fire_rate_b)
print(report.new_positives, report.lost_positives)
print(report.agreement_rate)
print(report.precision_a, report.recall_a)
print(report.precision_b, report.recall_b)

The C++ overloads are backtest(const BacktestConfig&) and backtest(parquet_paths, rules_a, rules_b, label_column). BacktestConfig contains parquet_paths, rules_file_a, rules_file_b, label_column, and batch_size.

📘

Shadow rules

A rule can carry a shadow field, which lets you evaluate a rule's effect without letting it drive the final decision. This is a natural fit for backtesting a single new rule inside an otherwise-unchanged ruleset. Use match_counts, winning_rule_ids, and grouped decision indices to see how often the shadow rule fired and which rows would have changed under a promoted candidate.

🚧

Replay window rules in chronological order

Window rules read prior-batch history, inject derived window columns, evaluate the current batch, then commit that batch for future batches. A backtest of any window-based rule is only correct if you feed batches in chronological order and keep entity affinity, exactly as the data arrived live. Same-batch repeated entity rows do not see earlier rows from that same batch by default. See the engine's window semantics for the full ordering contract.

Where to go next