Reading the Room: Real-Time Multimodal Intent Understanding

State estimation, not mind reading

In interviews, training, and support calls, a great human facilitator constantly reads the room — noticing confusion, hesitation, or the moment something clicks. We set out to estimate those signals in real time, while being explicit about a hard limit: the system estimates observable state, it does not read minds, detect deception, or infer hidden intent. Every output is a probability, not a verdict.

Three modalities, fused late

We combine three independent signals, each computed in the browser:

Vision — facial affect and attention from the live camera frame.
Voice — prosody: pitch, energy, tempo, and pause structure.
Speech — discourse cues mapped from the live transcript.

Reliability weighting is the whole game

Any single modality fails constantly in the real world — a hand covers the face, the room goes silent, no one has spoken for a while. The trick is late fusion with per-modality reliability weighting: when a signal is unreliable, it is automatically down-weighted rather than allowed to poison the estimate.

A brief (~3-second) calibration captures a resting baseline for the face and a loudness reference, so the system measures deviation from that person's normal rather than against a one-size-fits-all template.

A bounded, honest output space

Rather than an open-ended emotional read, the model maps to nine bounded states that are actually actionable: engaged, agreeing, confused, hesitant, frustrated, insight, bored, distracted, and wants-to-end-turn.

A lightweight logistic-regression classifier — trainable directly in the browser from short labeled bursts — turns a 24-dimensional feature vector into those state probabilities. Keeping the classifier small keeps it fast, private, and easy to retrain for a new context.

From signal to action

On top of the per-frame estimate sits a simple escalation layer: when a state persists past a threshold — for example, sustained frustration during a support call — it raises a timestamped alert a human can act on. The value is not the label; it is giving people a timely, well-calibrated nudge while keeping them firmly in control.