Slotting Architecture · 13 min read

Transforming Legacy ERP Sales Logs for Velocity Targeting

Older ERPs — mainframe AS/400 order systems, first-generation SAP exports, home-grown distribution suites — do not emit clean transactional feeds. They drop nightly fixed-width flat files with composite item keys, unit-of-measure buried in a free-text description column, return lines flagged only by a prefix on the item code, and two or three character encodings mixed inside one file. Feeding that straight into a velocity baseline produces phantom hyper-movers and silently misallocated pick faces. This page shows the exact transform that turns one legacy dump into canonical, base-unit velocity facts. It extends the Sales History Data Mapping layer for the ugly older-ERP cases the general mapper assumes away, and sits inside the wider Velocity Data Ingestion & WMS Sync Pipelines architecture.

Prerequisites

Before running the transform, confirm you have:

A copy of the raw legacy export, not a re-saved spreadsheet. Opening the dump in Excel and re-exporting silently re-encodes it and strips leading zeros from numeric item codes — always work from the original byte stream.
The fixed-width column map for that specific report layout (start/end offsets per field). Legacy layouts are report-specific; get the copybook or record layout from the ERP admin rather than eyeballing column boundaries.
The current item master pack factors (EA/CS/PLT → base eaches) so free-text UOM resolves against live values, not hard-coded guesses. This is the same conversion the Schema Validation for Inventory Feeds contract expects downstream.
Python 3.11+ with pydantic>=2.5 (optional, for stricter fact validation) and pytest>=8.0 for the verification block. The transform itself uses only the standard library.
Write access to a quarantine directory on the ingestion host so rejected lines are auditable rather than dropped.

Configuration block

Externalize every layout-specific value; the transform reads this config so a new report layout is a config change, not a code change.

# legacy_erp_transform.yaml
encodings: ["utf-8", "latin-1"]   # tried in order, strict mode; first that fully decodes wins
date_format: "%Y%m%d"             # legacy fixed-width dumps rarely carry date separators
default_uom: "EA"                 # applied only when the free-text note carries no UOM token
return_prefix: "R"                # legacy return lines prefix the item code, e.g. R000442
composite_key: "{wh}-{item}"      # canonical SKU joins warehouse + item to disambiguate reuse
fixed_widths:                     # 0-indexed [start, end) slices into each flat-file record
  warehouse: [0, 4]
  item:      [4, 16]
  qty:       [16, 24]
  uom_note:  [24, 44]
  date:      [44, 52]
uom_overrides:                    # item-master pack factors -> base eaches
  CS: 12
  PLT: 240
  BX: 6

The equivalent Python dict, loaded once at startup:

CFG = {
    "encodings": ["utf-8", "latin-1"],
    "date_format": "%Y%m%d",
    "default_uom": "EA",
    "return_prefix": "R",
    "composite_key": "{wh}-{item}",
    "fixed_widths": {
        "warehouse": (0, 4),
        "item": (4, 16),
        "qty": (16, 24),
        "uom_note": (24, 44),
        "date": (44, 52),
    },
    "uom_overrides": {"CS": 12, "PLT": 240, "BX": 6},
}

Implementation

A single focused transform: decode with encoding fallback, slice the fixed-width record, resolve the composite SKU key, parse UOM out of free text, and emit a signed base-unit VelocityFact. Anything it cannot resolve is quarantined (logged and skipped), never guessed.

import logging
import re
from dataclasses import dataclass
from datetime import date, datetime
from pathlib import Path
from typing import Iterator

logger = logging.getLogger("legacy_erp")

UOM_TOKEN_RE = re.compile(r"\b(EA|CS|CASE|PLT|PALLET|BX|BOX)\b", re.IGNORECASE)
UOM_ALIASES = {"CASE": "CS", "PALLET": "PLT", "BOX": "BX", "EACH": "EA"}


@dataclass(frozen=True)
class VelocityFact:
    sku: str          # canonical composite key, e.g. "WH01-000442"
    trans_date: date
    units_base: int   # signed base units (each); returns are negative
    source_line: int  # preserved for quarantine audit


def _decode(path: Path, encodings: tuple[str, ...]) -> Iterator[tuple[int, str]]:
    """Yield (line_no, text) using the first encoding that decodes the whole file."""
    for enc in encodings:
        try:
            with open(path, encoding=enc, errors="strict", newline="") as fh:
                for n, line in enumerate(fh, start=1):
                    yield n, line.rstrip("\r\n")
            return
        except UnicodeDecodeError:
            logger.warning("decode fallback: %s failed on %s", enc, path.name)
    raise ValueError(f"undecodable file: {path.name}")


def transform_legacy_log(path: Path, cfg: dict) -> Iterator[VelocityFact]:
    """Parse one legacy fixed-width ERP sales dump into canonical VelocityFact rows."""
    w = cfg["fixed_widths"]
    uom_map = {"EA": 1, **cfg["uom_overrides"]}
    for lineno, raw in _decode(path, tuple(cfg["encodings"])):
        if not raw.strip():
            continue
        try:
            wh = raw[slice(*w["warehouse"])].strip()
            item = raw[slice(*w["item"])].strip()
            qty = int(raw[slice(*w["qty"])].strip() or 0)
            note = raw[slice(*w["uom_note"])].strip()
            d = datetime.strptime(raw[slice(*w["date"])].strip(), cfg["date_format"]).date()
        except (ValueError, KeyError) as exc:
            logger.warning("QUARANTINE line %d (%s): parse error %s", lineno, path.name, exc)
            continue
        token = UOM_TOKEN_RE.search(note)
        uom = UOM_ALIASES.get(token.group(1).upper(), token.group(1).upper()) if token else cfg["default_uom"]
        if uom not in uom_map:
            logger.warning("QUARANTINE line %d: unknown UOM %r", lineno, uom)
            continue
        is_return = qty < 0 or item.startswith(cfg["return_prefix"])
        units_base = (-1 if is_return else 1) * abs(qty) * uom_map[uom]
        clean_item = item.lstrip(cfg["return_prefix"]) if is_return else item
        sku = cfg["composite_key"].format(wh=wh.upper(), item=clean_item.upper())
        yield VelocityFact(sku=sku, trans_date=d, units_base=units_base, source_line=lineno)

Step-by-step walkthrough

Encoding fallback (encodings). _decode opens the file in strict mode under each encoding in order. Strict mode raises on the first bad byte, so a file that is 90% UTF-8 with a stray latin-1 accented character falls through cleanly to latin-1 rather than corrupting silently. The first encoding that decodes the entire file wins.
Column slicing (fixed_widths). Each field is a [start, end) slice into the raw record. Legacy dumps have no delimiters, so the offsets are the contract — a wrong boundary shifts every field right and misaligns dates with quantities. All slices come from config, never literals.
Quantity and date parsing. qty and trans_date are the two fields most likely to be malformed in old data (blank pads, 00000000 dates). Any ValueError here routes the whole line to quarantine with its source_line preserved, rather than emitting a fact with a defaulted date.
Free-text UOM resolution (default_uom, uom_overrides). UOM_TOKEN_RE scans the description column for a unit token; UOM_ALIASES folds CASE→CS, PALLET→PLT. The token maps through uom_map (item-master pack factors plus the implicit EA: 1) to base eaches. An unrecognized UOM is quarantined — never assumed to be an each.
Return detection (return_prefix). A negative quantity or an item code carrying the return prefix marks the line as a return; units_base is signed negative so downstream netting subtracts it, and the prefix is stripped before building the key so returns and sales collapse to the same SKU.
Composite key (composite_key). Warehouse and item are joined into one canonical SKU so the same item number reused across facilities does not merge into a single phantom mover. This key is exactly the identifier the SKU Velocity Taxonomy Design layer expects when it assigns tiers.

Verification

Feed the transform a hand-built record with a known answer and assert the emitted fact. The sample line is a 20-case shipment of item 000442 at warehouse WH01, described as SHIPPED AS CASE, dated 2026-06-15:

import logging
from datetime import date
from pathlib import Path


def test_case_line_converts_to_base_units(tmp_path: Path) -> None:
    record = "WH01" + "000442".ljust(12) + "20".rjust(8) + "SHIPPED AS CASE".ljust(20) + "20260615"
    assert len(record) == 52  # slices must tile the record exactly
    dump = tmp_path / "sales_20260615.dat"
    dump.write_text(record + "\n", encoding="utf-8")

    facts = list(transform_legacy_log(dump, CFG))

    assert len(facts) == 1
    fact = facts[0]
    assert fact.sku == "WH01-000442"
    assert fact.trans_date == date(2026, 6, 15)
    assert fact.units_base == 240      # 20 cases * pack factor 12
    logging.getLogger("legacy_erp").info("verified fact: %s", fact)


# Expected output:
# VelocityFact(sku='WH01-000442', trans_date=datetime.date(2026, 6, 15),
#              units_base=240, source_line=1)

A return line such as R000442 with quantity 2 and note RETURN CASE yields units_base == -24 (2 cases × 12, negated) against the same WH01-000442 key, so netting cancels it against the sale.

Common pitfalls

Re-saving the dump before parsing. Opening the flat file in a spreadsheet re-encodes it and drops leading zeros from numeric item codes, so 000442 becomes 442 and silently forks into a second SKU. Always transform the original bytes.
Trusting a header UOM instead of the line note. Legacy reports sometimes print one UOM in the report header but override it per line in the free-text column. Parse UOM at the line level, as UOM_TOKEN_RE does, or every mixed-pack order distorts.
Column drift after an ERP patch. A vendor patch that widens the description field by two characters shifts every downstream slice. Keep the assert len(record) == expected_width guard in the verification suite and run it against a fresh sample after any ERP upgrade.
Netting returns before keys are normalized. If the return prefix is not stripped before building the composite key, R000442 and 000442 become different SKUs and the return never cancels the sale — the SKU reads as a permanent fast mover.

FAQ

Why fixed-width slicing instead of a CSV parser?

Because most legacy mainframe and AS/400 exports genuinely have no delimiter — the record layout is byte offsets, and any comma or pipe inside a description field would break a delimited parser. When a given report is actually delimited, feed it through the general mapper in Sales History Data Mapping instead; this page exists for the flat-file case that mapper assumes away.

How do I handle a file that mixes UTF-8 and latin-1 in one record?

Strict-mode fallback decodes the file under one encoding for the whole read; it will not silently mix. If a single record truly contains bytes valid in neither utf-8 nor latin-1, quarantine that line and inspect it — a genuinely mixed record almost always signals a corrupted transfer (an FTP run in text mode instead of binary), and the fix belongs at extraction, not in the parser.

Where do the emitted VelocityFact rows go next?

Straight into the rolling-window scoring in Async Batch Processing for Velocity, which averages and decay-weights the base-unit series into a velocity coefficient. This transform’s only job is to make every legacy line comparable to every modern one before that scoring runs.

Sales History Data Mapping — the parent normalization layer this transform extends for older ERPs.
Schema Validation for Inventory Feeds — the structural contract the emitted facts must satisfy downstream.
Async Batch Processing for Velocity — the scoring layer that consumes these base-unit facts.
Velocity Data Ingestion & WMS Sync Pipelines — the parent architecture this transform feeds.