Data Model and Schema

BlazeRules column types, how schema is inferred from the first batch, optional field hints, nested and dotted fields, and dictionary encoding.

This page describes how BlazeRules sees your data: the column types it works with, how it infers a schema when you do not provide one, the optional fields: hints, and how nested and categorical data are handled.

BlazeRules is columnar: a batch is a set of typed columns, and every operator runs against a column. Whether the input is JSON or Arrow, the engine evaluates over the same typed columnar model.

Column types

BlazeRules works with the following column types. In a rule file's fields: block they are written in lowercase (float32, entity_key, and so on).

TypeUse for
float32Single-precision numeric values (amounts, rates, coordinates).
float64Double-precision numeric values (e.g. latitude/longitude).
int3232-bit integers; also the encoding target for categoricals and entities.
int6464-bit integers (large IDs, BINs, bitfields).
categoricalA closed or open set of string labels, dictionary-encoded internally.
entity_keyAn identity column (e.g. card token) used for windows and grouping.
timestamp_msEpoch milliseconds for temporal operators and windows.
stringFree text for string and regex operators.
booleanTrue/false values.

In Python the enum members are uppercase: ColumnType.FLOAT32, ColumnType.ENTITY_KEY, ColumnType.TIMESTAMP_MS, ColumnType.BOOLEAN, and so on. The YAML spelling is lowercase.

Schema inference

You do not have to declare a schema. The lifecycle is:

  1. Unbound. load_rules(...) parses the rules; no schema is bound yet.
  2. Sample. The first evaluated batch samples the rule-referenced fields only.
  3. Infer + bind. Supported types are inferred from those samples and the schema is bound.
  4. Compile + activate. The plan compiles against the bound schema and the rules go live.

If you want explicit control instead of inference, construct the engine with a schema built from field definitions:

import blazerules

schema = [
    blazerules.Field("card_token", blazerules.ColumnType.ENTITY_KEY, nullable=False),
    blazerules.Field("amount", blazerules.ColumnType.FLOAT32, nullable=False),
    blazerules.Field("device_type", blazerules.ColumnType.CATEGORICAL),
]
config = blazerules.EngineConfig()
engine = blazerules.RuleEngine(schema, config)

Field(name, type, nullable=True) returns a FieldSpec. ENTITY_KEY fields automatically set is_entity_field.

Optional field hints

The top-level fields: block in a rule file is a set of hints, not a mandatory user schema. Hints are most useful for entity keys, timestamps, nullability, and closed categorical value sets. This is the fields: block from the canonical sample rule file:

fields:
  card_token: {type: entity_key, nullable: false}
  amount: {type: float32, nullable: false}
  account_age_days: {type: int32}
  hour_of_day: {type: float32}
  country_code:
    type: categorical
    values: [US, GB, IN, DE, BR, CN, RU]
  device_type:
    type: categorical
    values: [ios, android, web, emulator]
  event_ts_ms: {type: timestamp_ms}
  optional_note: {type: string, nullable: true}
  merchant.risk.score: {type: float32}

Note the patterns: an entity_key for grouping/windows, categorical fields with an explicit closed values: list, a nullable: true field, a timestamp_ms column, and a dotted name (merchant.risk.score) for a nested field.

Nested and dotted fields

Nested records are addressed with dotted field names. The same dotted path works for nested JSON objects and for Arrow struct fields:

{"merchant":{"risk":{"score":91}}}
conditions:
  field: merchant.risk.score
  op: gt
  value: 50

For arrays of objects, array_any applies a sub-condition with same-element semantics — a record matches only when a single array element satisfies all of the inner conditions at once.

Dictionary encoding

categorical and entity_key columns are dictionary-encoded into int32 IDs. This is why set and equality operators on these columns run as fast integer comparisons rather than string comparisons, and it is part of why the categorical and set operator families are vectorized.

Gotcha: ambiguous fields infer as INT32

Pin ambiguous fields with hints

When a field is unbound and its sampled values are ambiguous, inference can land on INT32 — which will reject or mis-handle values you intended as another type. If a field is an identity, a category, a timestamp, or nullable, declare it in the fields: block. The hints above exist precisely to remove this ambiguity, and they make rule loading deterministic across environments.

Where to go next