Signals

active

Metadata governance for the open enterprise data stack.

Signals is a metadata governance platform for the open enterprise data stack. Enterprise warehouses accumulate thousands of tables and columns with inconsistent naming, no sensitivity labels, and no formal governance metadata. Manual classification is expensive, error-prone, and immediately stale. Flat confidence scores from ML classifiers conflate definitely Payment Card Number with definitely some kind of Payment Information but unsure which sub-type — a distinction that matters when a policy engine has to act.

Approach

The classification pipeline combines five evidence sources — embedding similarity, gradient-boosted prediction, regex pattern detection, column name matching, and short-text SVM — through Dempster–Shafer belief functions rather than simple score averaging. Each source contributes a mass function over a restricted frame of discernment derived from the taxonomy hierarchy. Dempster’s rule yields a belief interval at every node, separating what the evidence commits to from what it merely fails to contradict. The interval width quantifies epistemic uncertainty; the Dempster conflict between sources flags disagreement that flat scores suppress.

Architecture

Three operational layers, all replacing proprietary or Hadoop-coupled components with open alternatives:

Query stack (HMS-free). Impala and Kudu without the Hive Metastore, HDFS, or HBase. A PostgreSQL catalog registry manages table metadata. Kudu provides upsert-heavy hot-tier storage; Iceberg via a Polaris REST catalog provides warm-tier analytics; Impala provides transparent SQL across both tiers.
Metadata governance. Apache Atlas runs against a PostgreSQL + AGE graph backend, replacing the JanusGraph/HBase/Solr trio. A catalog bridge registers Impala-managed Kudu tables as Atlas entities; classification results, confidence scores, and evidence strings are written back to Atlas for policy consumption.
Classification pipeline. The sigint package extracts twelve discrete features per column, runs the five evidence sources, combines them via Dempster–Shafer fusion, and quantifies per-feature marginal contribution through SAGE Shapley values.

Ontology

The SIGDG ontology (Signals Data Governance) is grounded in BFO 2020. Information entities are generically dependent continuants; sensitivity levels are qualities that inhere in them; data subject roles are roles that allow the same column to carry different sensitivity depending on whose data it is. Forty-two categories across six top-level kinds (Identity, Personal, Business, System, Transaction, Transformation), thirty leaf nodes, four sensitivity levels (Public → Internal → Confidential → Restricted).

Position in the architecture

Signals is the platform; the other data-governance projects compose around it. Atelier is the agentic workbench that fronts the same pipeline interactively, adding an LLM-driven sixth source for convergence on high-uncertainty columns. Aegir contributes a learned cross-table signal on top of the per-column annotations. SDG corpora publishes the shared ontology, SKOS vocabulary, and reference relational footprint that all three consume.