Beyond Scanpaths: Graph-Based Gaze Simulation in Dynamic Scenes

Luke Palmer1*, Petar Palasek1*, Hazem Abdelkawy2
1GlimpseML, 2Toyota Motor Europe
*Indicates Equal Contribution

Left: ground-truth gaze from six people. Right: six autoregressive simulations from our method. We present an autoregressive graph transformer approach that simulates continuous human-like gaze sequences, and release Focus100, a multi-subject gaze dataset for temporal attention modelling. Our framework produces state-of-the-art gaze trajectories, scanpaths, and saliency maps from a single model.

Abstract

Accurately modelling human attention is essential for numerous computer vision applications, particularly in the domain of automotive safety. Existing methods typically collapse gaze into saliency maps or scanpaths, treating gaze dynamics only implicitly. We instead formulate gaze modelling as an autoregressive dynamical system and explicitly unroll raw gaze trajectories over time, conditioned on both gaze history and the evolving environment. Driving scenes are represented as gaze-centric graphs processed by the Affinity Relation Transformer (ART), a heterogeneous graph transformer that models interactions between driver gaze, traffic objects, and road structure. We further introduce the Object Density Network (ODN) to predict next-step gaze distributions, capturing the stochastic and object-centric nature of attentional shifts in complex environments. We also release Focus100, a new dataset of raw gaze data from 30 participants viewing egocentric driving footage. Trained directly on raw gaze, without fixation filtering, our unified approach produces more natural gaze trajectories, scanpath dynamics, and saliency maps than existing attention models, offering valuable insights for the temporal modelling of human attention in dynamic environments.

Affinity Relation Transformer + Object Density Network

Weight distance correlation

We propose a novel framework for gaze prediction by modeling its spatiotemporal evolution as an active participant in the environment. Our approach leverages graph-based simulation (GBS) to capture complex relationships by representing systems as graphs of objects/agents and their interactions, marking a first for GBS in attention modeling.

A gaze-centric traffic graph (right), processed by our Affinity Relation graph Transformer (ART), captures dynamic interactions between driver gaze and traffic objects, injecting relationship information into message passing. An Object-based Density Network (ODN) then predicts next-step gaze distributions, accounting for the object-centric nature of attention shifts.

Weight distance correlation Traffic scenes as heterogeneous graphs with nodes representing road structure, traffic agents, and ego-centric driver gaze.

Focus100 Dataset

Weight distance correlation
Focus100 dataset examples. The top row shows diversity in pedestrian traffic, hazardousness, and road type. The bottom row shows the same video frame overlaid with the gaze samples of three separate subjects over the previous 2s; demonstrating the diversity of temporal gaze patterns for the same stimulus.

Focus100 is a new dataset designed to facilitate research on dynamic human attention in driving scenarios, particularly for the development and evaluation of gaze estimation models. This dataset addresses critical limitations in existing driving gaze datasets, which often lack raw temporal gaze data or sufficient scenario diversity.

Focus100 features high-quality, multi-subject gaze data collected from 30 engaged subjects viewing 100 1-minute videos of ego-centric driving footage. The dataset includes 15 hours of raw gaze sequences from at least 7 subjects per video, providing valuable insights for temporal attention modeling and automotive safety.

Gaze Dynamics

Weight distance correlation

We compared our simulations to ground truth (human) and state-of-the-art driving and ego-centric gaze estimation methods on the Focus100 dataset. Above we see example gaze (left) and saliency maps (right) generated across a Focus100 video. Each column is a method, each line is a ground truth or simulated gaze sequence (x-axis for interpretability), and average fixation duration (FD) is given per method. ART (ours) produces more human-like temporal gaze sequences than existing approaches. Saliency maps are also shown for each method, aligned temporally with the gaze sequences. ART saliency maps (aggregations across simulations) show better spatial alignment with human fixations than baselines designed specifically for saliency map generation. Results are borne out across gaze similarity, fixation dynamics, and saliency metrics:

Weight distance correlation

Saliency Maps

Using our generated simulations we are able to create other gaze representations, such as scanpaths and saliency. Above you can see heatmaps produced by ART simulations over a Focus100 test-set sequence (unseen during model training). The overlaid red circles are ground truth (human) gaze data.

BibTeX

@inproceedings{palmer_beyond_scanpaths_cvpr2026,
  title     = {Beyond Scanpaths: Graph-Based Gaze Simulation in Dynamic Scenes},
  author    = {Palmer, Luke and Palasek, Petar and Abdelkawy, Hazem},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}