Brandon B. May

Staff Engineer — Large Driving Models

Motional

Robot Learning · World Models · VLAs · Foundation Models

Brandon B. May

About

I’m a Staff Engineer at Motional, where I train and distill Vision Language Action (VLA) models for autonomous driving. My work centers on generative AI for robotics, specifically world models, learned perception, and VLAs. My ultimate goal is to help build a generally intelligent robot, and I believe the path runs through models that can imagine, simulate, and act in the physical world.

My career has been a steady zoom out from physics to pixels to policies. It started with math and physics at Skidmore and imaging science at RIT, then five years of computational imaging and SLAM at MITRE, DARPA-funded perception R&D at STR, and most recently VLAs and world models for robotic manipulation at the Robotics & AI Institute. If you’re building in this space, let’s connect.

Publications

SIMIFY robot manipulation

SIMIFY: Generative Real-to-Sim Enables Multi-Object Spatial and Physical Reasoning

Brandon B. May et al.

arXiv, March 2026

TL;DR: A training-free, test-time framework that reconstructs simulation-ready assets from a single RGB-D image using 3D generative and vision-language models, then launches thousands of parallel physics rollouts with evolutionary search to optimize object arrangements for language-specified tasks. Achieves 67% success on real-robot hardware, surpassing baselines using off-the-shelf foundation models.

Real-is-Sim framework diagram

Real-is-Sim: Bridging the Sim-to-Real Gap with a Dynamic Digital Twin for Real-World Robot Policy Evaluation

Jad Abou-Chakra, Lingfeng Sun, Krishan Rana, Brandon B. May, Karl Schmeckpeper, Niko Sünderhauf, Maria Vittoria Minniti, Laura Herlant

2026 ICRA 2026

TL;DR: We invert the traditional sim-to-real paradigm by building a dynamic digital twin powered by an Embodied Gaussian simulator that synchronizes with the real world at 60Hz. Instead of training on real hardware or adapting simulations post-hoc, policies always execute on a virtual robot while the physical robot mirrors its joint states. Continuous real-world measurements correct the simulation on the fly, and we validate the approach on long-horizon manipulation tasks like PushT, showing tight alignment between virtual evaluations and physical results.

PIEGraph figure

Learning Equivariant Neural-Augmented Object Dynamics from Few Interactions

Sergio Orozco, Brandon B. May, Tushar Kusnur, George Konidaris, Laura Herlant

2025 RINO @ CoRL 2025 Best Extended Abstract

TL;DR: PIEGraph learns physically grounded dynamics for rigid and deformable objects from just a few minutes of human interaction data. It combines a physics-informed spring-mass prior with an action-conditioned equivariant graph neural network, maintaining physical plausibility (no interpenetration, shape preservation) where standard particle-based models break down. Demonstrated on ropes, cloth, stuffed animals, and rigid bodies for robotic planning.

Theia method overview

Theia: Distilling Diverse Vision Foundation Models for Robot Learning

Jinghuan Shang, Karl Schmeckpeper, Brandon B. May, Maria Vittoria Minniti, Tarik Kelestemur, David Watkins, Laura Herlant

2024 CoRL 2024

TL;DR: Theia distills multiple off-the-shelf vision foundation models (DINOv2, CLIP, SAM, Depth-Anything, and more) into a single compact model optimized for robot learning. The result outperforms any individual teacher model while being smaller and requiring less training data. We also find that higher entropy in feature norm distributions correlates with better downstream robot performance, offering a practical proxy for representation quality. Pre-trained models available on Hugging Face.

Sancdifi method overview

Salient Conditional Diffusion for Defending Against Backdoor Attacks

Brandon B. May, N. Joseph Tatro, Piyush Kumar et al.

2023 Backdoor Attacks & Defenses @ ICLR 2023 Spotlight

TL;DR: Sancdifi defends against backdoor attacks by using a denoising diffusion model to degrade and recover images, with saliency-based conditioning that concentrates the diffusion on the most important regions. This strips backdoor triggers while preserving legitimate features. Crucially, it works as a black-box defense with no access to the target model's internals.

Overhead imagery dataset

Comprehensive Dataset of Synthetic and Manipulated Overhead Imagery for Development and Evaluation of Forensic Tools

Brandon B. May, Kirill Trapeznikov, Shengbang Fang, Matthew C. Stamm

2023 IH&MMSec 2023 Best Paper

TL;DR: We release a first-of-its-kind dataset of real, fully synthetic, and partially manipulated overhead imagery for forensic research. The synthetic images are generated from a custom diffusion model trained across multiple zoom levels and data sources, enabling research into detecting and localizing manipulated satellite imagery.

Explainable face recognition saliency maps

Explainable Face Recognition

Jonathan R. Williford, Brandon B. May, Jeffrey Byrne

2020 ECCV 2020

TL;DR: We introduce the first comprehensive benchmark for explainable face recognition, including “the inpainting game,” a standardized evaluation protocol of 3,648 triplets where facial features are synthetically modified to create ground truth. We propose two new attention methods, subtree EBP and DISE (Density-based Input Sampling for Explanation), which significantly outperform prior techniques at revealing which facial regions drive a network's matching decisions.

Interested in collaborating or just want to say hello?