Smart Building Digital Twin Blueprint#

Blueprint Series — Next-Generation Smart Building & Smart Space Solutions Powered by 4D Live Digital Twins
Edge Vision + Spatial Computing with Intel SceneScape

A physical space is more than its floor plan. The built environment — walls, doors, rooms — provides structure. Cameras and sensors observe it. People move through it; objects are carried, placed, and left behind. Each of these elements generates data, but in isolation, none is aware of the others. What if all of it could be unified into a single, continuously updated digital model of the space — one expressed in real-world metric coordinates, queryable in real time, and capable of reasoning about what is actually happening?

That is the premise of the Smart Building Digital Twin. This blueprint, built around a real indoor dataset — a shared showcase area and a medium-sized conference room — demonstrates how edge vision AI and spatial computing with Intel SceneScape can power a new generation of smart building use cases that would be impossible with video analytics alone.

The result is a “4D” digital twin: a metric 3D scene that evolves over time, where every tracked entity (person, object, door) has a precise position, velocity, and history — fused with environmental and identity sensor data — and where intelligent agents operate on scene data, not pixels.

The Dataset and Physical Environment#

The sample dataset exercises the blueprint across two distinct zones:

Zone

Description

Showcase Area

Shared open space with entry point. Monitored for access control, tailgating, and object security.

Conference Room

Medium-sized enclosed meeting room. Monitored for occupancy, safety, and environmental comfort.

3D Scene Foundation: Spatial Scan#

The space was 3D-scanned using a photogrammetry or LiDAR-based capture tool — any application capable of producing a dense mesh and textured point cloud of an indoor environment can serve this role (examples include Polycam*, Matterport*, and similar tools). This scan serves as the geometric foundation of the digital twin and the reference frame for camera calibration in SceneScape.

By anchoring every camera’s extrinsic parameters to the same real-world coordinate system, SceneScape transforms 2D pixel detections from multiple cameras into a unified metric-space scene — the cornerstone of all downstream analytics.

Sensor Complement#

Sensor

Zone

Integration Type

Light sensor

Conference Room

SceneScape Environmental Sensor

Temperature sensor

Conference Room

SceneScape Environmental Sensor

Noise level sensor

Conference Room

SceneScape Environmental Sensor

Badge reader

Showcase Area Entry

SceneScape Attribute Sensor

Intel RealSense FaceID

Showcase Area Entry

SceneScape Attribute Sensor

Privacy-Preserving Identity Sensors#

Both the badge reader and the RealSense FaceID sensor are configured as attribute sensors — they attach metadata to nearby tracks rather than operating as standalone identity systems.

Critically, both sensors are obfuscated at the source. Each generates an anonymized, one-way hash that is unique to the credential (badge ID or faceprint) but from which the original cannot be recovered. The system never stores or transmits raw badge IDs or biometric data. Instead, it learns patterns — for example, that hash B-4a7f and hash F-9c2d consistently appear together at the entry. Over time, this enables long-term anomaly detection such as badge switching without any personally identifiable information leaving the edge.


The AI Model#

A custom model is trained for this environment to detect:

  • Persons — full-body detection for tracking and behavior analysis

  • Luggage — detection for unattended item and luggage-switch scenarios

  • Doors — detection for door state monitoring (closed, open, propped)

Model training uses Intel Geti, Intel’s end-to-end model training and management platform, with the captured dataset from the physical space. Geti enables rapid annotation, iterative training, and optimized model export — ensuring the model is tuned to the specific lighting, geometry, and camera angles of the deployment. Exported models are optimized for Intel hardware via OpenVINO and pipelined through Intel DL-Streamer.


Architecture: The 4D Digital Twin Stack#

The stack is organized from hardware to application, each layer cleanly separated:

        block
  columns 1

  block:agents_layer
    columns 2
    AGT["Spatial Agents<br/>Occupancy · Tailgating · Fall Detection<br/>MFA · Luggage Switch · Door Monitor"]
    BIZ["Spatial Business Logic<br/>Scene Event Processing<br/>Alert Generation · Pattern Learning"]
  end

  SS["Spatial Analytics and Tracking<br/>Intel SceneScape<br/>Metric-space tracking · 3D scene fusion · MQTT broker"]

  block:edge_layer
    columns 2
    VIS["Vision Analytics and AI<br/>Camera Streams · Object Detection<br/>Door · Person · Luggage"]
    SENS["Sensor Data Ingest<br/>Environmental · IoT<br/>Badge Reader · RealSense FaceID"]
  end

  AIFW["AI Frameworks and Libraries<br/>DL-Streamer · OpenVINO · Intel Geti"]

  OS["Operating System — Linux"]

  block:hw_layer
    columns 3
    CPU["CPU<br/>General Compute"]
    GPU["GPU<br/>Vision Workloads"]
    NPU["NPU<br/>Neural Inference"]
  end

  style AGT fill:#3a1a1a,color:#f8d8b8,stroke:#9a6a3a
  style BIZ fill:#3a1a1a,color:#f8d8b8,stroke:#9a6a3a
  style SS fill:#1a3a2a,color:#b8f0d8,stroke:#3a8a6a
  style VIS fill:#2a1a3a,color:#d8b8f8,stroke:#7a4a9a
  style SENS fill:#2a1a3a,color:#d8b8f8,stroke:#7a4a9a
  style AIFW fill:#1a2a3a,color:#b8d8f8,stroke:#3a6a9a
  style OS fill:#2b3a2b,color:#c8f0c8,stroke:#4a8a4a
  style CPU fill:#1c1c1c,color:#e0e0e0,stroke:#555
  style GPU fill:#1c1c1c,color:#e0e0e0,stroke:#555
  style NPU fill:#1c1c1c,color:#e0e0e0,stroke:#555
  style agents_layer fill:#3a1a1a,color:#f8d8b8,stroke:#9a6a3a
  style edge_layer fill:#2a1a3a,color:#d8b8f8,stroke:#7a4a9a
  style hw_layer fill:#2c2c2c,color:#e0e0e0,stroke:#555
    

Layer Descriptions#

Hardware — Intel Panther Lake and Beyond#

Next-generation Intel client and edge platforms with integrated CPU, GPU, and NPU provide heterogeneous compute for running inference workloads efficiently. Vision pipelines, tracking algorithms, and agent logic can be distributed across these accelerators to maximize throughput while minimizing power consumption — critical for always-on building infrastructure.

OS — Linux#

The deployment runs on Linux, providing a stable, containerizable foundation for all services. SceneScape and its dependencies are containerized, enabling repeatable deployment across edge nodes.

AI Frameworks & Libraries#

  • Intel DL-Streamer — GStreamer-based pipeline framework for video analytics, handling camera ingest, pre/post-processing, and inference orchestration.

  • Intel OpenVINO — Model optimization and inference runtime for Intel hardware, enabling quantized, hardware-accelerated inference of the door/person/luggage detection model.

  • Intel Geti — End-to-end model training, annotation, and management platform used to build the custom door/person/luggage detection model on the captured dataset. Geti streamlines iterative training and exports optimized models directly to OpenVINO IR format.

Vision Analytics & AI / Sensor Data Ingest#

Two parallel ingest paths converge at SceneScape:

  1. Camera pipelines — Each camera’s video stream passes through DL-Streamer/OpenVINO, producing per-frame bounding box detections (person, door, luggage) that are published to SceneScape’s MQTT ingest.

  2. Sensor pipelines — IoT sensors (light, temperature, noise, badge reader, FaceID) publish their readings to SceneScape via MQTT. SceneScape distinguishes two sensor types with different tagging semantics:

    • Environmental sensors (light, temperature, noise) tag tracks with the current reading while the tracked object is within the measurement area. The reading updates continuously but does not persist on the track after the object leaves the area — making the value a live, location-bound annotation.

    • Attribute sensors (badge reader, FaceID) tag tracks with a value that persists on the track after the object leaves the sensor area. This means agents and applications can read identity or access attributes directly from a track anywhere in the scene, long after the person has moved away from the sensor.

Spatial Analytics & Tracking — Intel SceneScape#

SceneScape is the core of the digital twin. It:

  • Fuses multi-camera detections into coherent tracks using the calibrated 3D scene geometry from the spatial scan.

  • Expresses all entities in metric units — positions are in meters in world space, velocities are in m/s — enabling geometry-aware business logic with no pixel arithmetic.

  • Fuses sensor attributes — when a person walks near the badge reader or FaceID sensor, their track is annotated with the corresponding anonymized hash.

  • Tags tracks with environmental readings — environmental sensors (light, temperature, noise) annotate nearby tracks with the current sensor value while those tracks are within the measurement area. The annotation updates in real-time but expires when the track leaves the area, so agents reading a track’s environmental attributes always see the current local conditions for that object — not a stale reading from a previous location. This is surfaced visually in the 3D SceneScape UI, where the rendered light level in the conference room tracks the physical sensor.

  • Provides a queryable scene API — downstream agents and applications consume scene state over MQTT and the SceneScape REST API.

Spatial Agents & Business Logic#

All nine use cases are implemented as scene agents — they subscribe to SceneScape’s track and attribute stream and apply business logic purely on spatial data: positions, velocities, dwell times, track associations, and attribute hashes. Agents operate on scene data, not live video streams. However, scene events such as region entry/exit and tripwire crossings can trigger image or short clip capture, and one agent may trigger data capture that another agent then analyzes — for example, a tailgating or anomalous badge-entry detection triggering image evidence capture for review or secondary AI analysis.


Data Flow#

        flowchart BT
  LOGIC["Scene Agents & Business Logic<br/>Occupancy · Tailgating · Fall Detection · MFA · Luggage Switch · Door Monitor"]

  SS["Intel SceneScape<br/>Metric-Space Tracking · Attribute Fusion · 3D Digital Twin UI"]

  DLP["DL-Streamer Pipeline Server · OpenVINO<br/>Object Detection — Door · Person · Luggage"]
  SENSN["Sensor Network<br/>Light · Temp · Noise · Badge Reader · RealSense FaceID"]

  CAMS["Camera Network<br/>Showcase Area (7) · Conference Room (4)"]

  CAMS --> DLP
  DLP --> SS
  SENSN --> SS
  SS --> LOGIC

  style LOGIC fill:#3a1a1a,color:#f8d8b8,stroke:#9a6a3a
  style SS fill:#1a3a2a,color:#b8f0d8,stroke:#3a8a6a
  style DLP fill:#2a1a3a,color:#d8c8f8,stroke:#7a4a9a
  style SENSN fill:#3a2a1a,color:#f8e8c8,stroke:#9a7a3a
  style CAMS fill:#1a2a3a,color:#c8d8f8,stroke:#3a6a9a
    

Use Cases#

All nine use cases operate on scene data — the continuous stream of entity positions, velocities, dwell times, and fused attributes produced by SceneScape. The vision system’s job is detection; SceneScape’s job is spatial fusion; the agent’s job is interpretation.


1. Room Occupancy#

Zone: Conference Room Data used: Person track positions, entry/exit events

Multi-camera coverage of the conference room eliminates blind spots that would confound single-camera occupancy counting. SceneScape resolves persons across camera views into single tracks in 3D space, so each person is counted exactly once regardless of how many cameras see them. The occupancy count is a live scene property, suitable for HVAC control, scheduling systems, and capacity enforcement dashboards.


2. Person Left Behind#

Zone: Conference Room Data used: Person tracks, room entry/exit transitions, dwell time

A group of persons enters the conference room, a meeting occurs, and the group exits — except one person who remains. The agent monitors the transition: when the room transitions from ≥ N occupants to a small residual count (or to 1) during a period when the norm is a full egress, an alert is triggered. In an emergency evacuation scenario, this pattern indicates someone who needs assistance. The agent tracks entry and exit events at the room boundary tripwire and monitors sustained presence post-group-departure.


3. Room Environment — Live Sensor in the Digital Twin#

Zone: Conference Room Data used: Light sensor (lux), temperature (°C), noise level (dB)

Environmental sensors are integrated into SceneScape as first-class scene properties, with a specific and powerful tagging behavior: rather than simply updating a global scene value, they annotate nearby object tracks with the current sensor reading while those tracks are within the measurement area. The reading updates continuously but does not persist on a track after it leaves the sensor zone — giving agents and applications an always-current, location-bound environmental context on each tracked object.

The 3D SceneScape UI reflects this: the rendered light level in the conference room follows the physical sensor reading. Temperature and noise data are similarly surfaced, enabling comfort analytics and policy enforcement (e.g., alerting when a person’s track carries a noise reading above threshold during designated quiet hours).


4. Tailgating Detection#

Zone: Showcase Area Entry Data used: Person tracks, badge attribute, entry tripwire

An entry tripwire is defined in metric space at the showcase area entrance. Every track that crosses the tripwire is examined for a badge attribute. A valid entry is one where a badge hash is associated with the track before or during the crossing event. A track that crosses the tripwire without a badge attribute is flagged as a potential tailgater — a person who entered physically without presenting credentials, likely by following an authorized person through the door.

The badge reader’s proximity-based attribute tagging means the association is spatial: a person must be near the reader for a badge hash to attach to their track. The logic is therefore: no badge attribute on inbound crossing → tailgating alert.


5. Physical Multi-Factor Authentication (MFA)#

Zone: Showcase Area Entry Data used: Badge attribute hash, FaceID attribute hash, long-term learned patterns

The badge reader and RealSense FaceID sensor each tag a nearby person’s track with their respective anonymized hashes. Over multiple legitimate access events, the system learns that hash pair (B-4a7f, F-9c2d) co-occur consistently on the same track — this person always presents their badge and their face together.

When an access event occurs where a badge hash appears with an unfamiliar or mismatched face hash — for example, Person A’s badge is used by Person B — the learned pattern is violated. The system flags this as a physical MFA anomaly. In the dataset, this scenario is demonstrated by persons deliberately switching badges and attempting re-entry.

Because only hashes are stored and no biometric data is retained, this operates as a privacy-preserving behavioral pattern detector rather than a biometric identification system.


6. Fall Detection#

Zone: Any monitored zone Data used: Person track position, velocity, aspect ratio change, dwell time

A person who falls undergoes a rapid, characteristic change in their spatial signature: their centroid drops quickly, their bounding-box aspect ratio changes from tall-and-narrow to wide-and-flat, and their velocity drops to zero and remains there. The scene agent monitors for this combination: rapid vertical transition + aspect ratio inversion + sustained stillness exceeding a configurable threshold. If the person does not resume movement within that window, a fall-and-no-response alert is issued. Unlike video-based approaches, this operates entirely on track kinematics and geometry in metric space.


7. Unattended Luggage / Left Behind Object#

Zone: Any monitored zone Data used: Luggage track position, person track positions, association history, dwell time

When a luggage track is detected in the scene, the agent maintains an association between the luggage and the nearest person track. If the associated person departs — their track leaves the proximity radius — and the luggage remains stationary for a configurable dwell period with no new person association, the luggage is flagged as unattended. This is a pure spatial computation: distances in metric space, dwell time in seconds, no video required.


8. Luggage Switch Detection#

Zone: Any monitored zone Data used: Luggage tracks, person tracks, proximity events, pre/post-encounter luggage associations

Two individuals, each carrying an item, approach each other in the scene. Their tracks converge within a proximity threshold. After the encounter, both individuals depart — but the luggage association has swapped: each person leaves with the other’s luggage. The agent detects this by tracking luggage-to-person associations before and after a close-proximity event. A swap is detected when the post-encounter luggage track associated with a person differs from their pre-encounter luggage track. This is particularly relevant for detecting deliberate handoffs or exchanges in security-sensitive areas.


9. Door Propped Open#

Zone: Showcase Area Entry / Conference Room Data used: Door detection, door position/state, learned normal distribution

The object detection model includes door detection. Over time, SceneScape learns the normal positional distribution of each door in the scene — typically a tight cluster around the “closed” position. A door that deviates significantly from this learned position for a sustained period triggers a propped-open alert. Importantly, when a door is wide open, it may not be reliably detected at all — the absence of a door detection, combined with the last known open position and the elapsed time, is itself a signal the agent interprets. This makes the logic robust to detection gaps.


Architectural Principles#

Scene Data, Not Video#

Every use case described above is implemented without ever analyzing video frames in the business logic layer. The vision pipeline’s output is a stream of bounding boxes; SceneScape’s output is a stream of metric-space track records. Business logic agents subscribe to the track stream only. This separation has several consequences:

  • Privacy by architecture — no raw video is retained or processed downstream of the detection pipeline.

  • Scalability — scene data is orders of magnitude smaller than video. Agents can run cheaply on the same edge node or on a management server.

  • Composability — the same track stream feeds all nine use cases simultaneously. Adding a new use case is additive, not disruptive.

  • Sensor agnosticism — the same agent logic works whether tracks come from 2 cameras or 20, and whether additional sensors are present or not.

Edge-Native Deployment#

All inference and tracking runs on-premise, on Intel edge hardware. No video or biometric data leaves the facility. Only derived, anonymized scene events are candidates for cloud reporting. This is consistent with enterprise security and data residency requirements.

Privacy-Preserving by Design#

  • Environmental sensors produce non-personal data.

  • Vision tracks are ephemeral geometric objects — they do not carry identity unless an attribute sensor explicitly annotates them.

  • Attribute sensors hash credentials at the point of collection; the hash is one-way and cannot be reversed.

  • Long-term pattern learning operates on hash pairs, not identity. The system detects anomalies in patterns, not identities of individuals.

The Role of 3D Scanning#

The 3D spatial scan is not merely a visualization asset. It is the calibration ground truth that makes multi-camera metric tracking possible. Without a shared 3D reference frame, detections from different cameras cannot be reliably fused in space. The scan turns a collection of independent cameras into a coherent sensor array.


From Blueprint to Deployment#

This blueprint represents a complete, end-to-end reference architecture. The key integration points for deployment practitioners are:

Integration Point

Technology

Notes

3D Scene Capture

Photogrammetry / LiDAR tool (e.g. Polycam*, Matterport*)

Export as GLTF (.glb) for SceneScape import

Camera Calibration

Intel SceneScape

Manual or automated using reference points in 3D scan

Video Ingest

Intel DL-Streamer

GStreamer pipeline with RTSP/USB camera sources

Object Detection

OpenVINO

Fine-tuned on site-specific dataset

Environmental Sensors

MQTT → SceneScape

Standard SceneScape environmental sensor API

Attribute Sensors (Badge/Face)

MQTT → SceneScape

SceneScape attribute sensor API with local hashing

Scene Event Bus

MQTT

SceneScape publishes track/attribute updates

Business Logic Agents

Python / any MQTT client

Subscribe to SceneScape scene topic

Digital Twin UI

SceneScape 3D UI

WebGL-based, shows live scene with sensor overlays

Hardware Platform

Intel Panther Lake

CPU + GPU + NPU; OpenVINO targets all three


Conclusion#

The Smart Building Digital Twin blueprint demonstrates that the convergence of edge vision AI, spatial computing, and heterogeneous Intel hardware makes it possible to build smart spaces that are simultaneously more capable, more private, and more efficient than traditional video analytics deployments.

By anchoring every camera and sensor to a shared 3D map, fusing detections into metric-space tracks, and operating all business logic on scene data rather than video, the architecture delivers nine distinct smart building use cases from a single unified pipeline. The same foundation scales from a single conference room to a campus.

The 4D live digital twin — three spatial dimensions plus continuous time — is not a visualization. It is an operational system. And with Intel SceneScape as its spatial computing core, it is deployable on real hardware, with real data, today.


This blueprint is part of the Intel Open Edge Platform architecture blueprint series. For technical documentation see the Intel SceneScape docs and the SceneScape repository on GitHub.