Welcome to my academic homepage. I am a machine learning researcher. I received my Ph.D. in Computer Science at University of California, Los Angeles (UCLA) . I went to Tsinghua University for undergraduate study in Computer Science.

I work on multimodal learning for understanding, reasoning and skill learning. In particular, I'm interested in building models/agents that can learn from 2D/3D vision and text data, and perform a wide range of reasoning and embodied control tasks. Some of my research keywords can be found below:

  • Multimodal learning: Vision and language, Visual reasoning, 3D vision, Generalist models
  • Representation learning: Zero-shot and few-shot learning, Generative model
  • Embodied agents: Reinforcement learning and imitation, Robotics, Sensor fusion
Email: jeasinema [at] gmail [dot] com / Google Scholar / LinkedIn

News


Recent Publications


Preprint

Rujie Wu*, Xiaojian Ma*, Hai Ci, Yue Fan, Rongpeng Su, Yuxuan Wang, Haozhe Zhao*, Qing Li, Yizhou Wang,
LongViTU: Instruction Tuning for Long-Form Video Understanding
arXiv preprint / arXiv / Project
Synthetic data for long-form video understanding.

Jun Guo*, Xiaojian Ma*, Yue Fan, Huaping Liu, Qing Li
Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting
arXiv preprint / arXiv / Project

Team CraftJarvis Icon
🐭 RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation
arXiv preprint / arXiv / Project / Demo 
Significant improvement on creativity writing, codegen, and math problems.

Sirui Xie, Xiaojian Ma, Peiyu Yu, Yixin Zhu, Ying Nian Wu and Song-Chun Zhu
HALMA: Humanlike Abstraction Learning Meets Affordance in Rapid Problem Solving
arXiv preprint / Paper / Click Here to Play HALMA!  / arXiv 


Conference and Journal

Huimin Wu*, Xiaojian Ma*, Haozhe Zhao*, Yanpeng Zhao, Qing Li
NEP: Autoregressive Image Editing via Next Editing Token Prediction
NeurIPS 2025 / arXiv / Project
Autoregressive T2I with native image editing and test-time scaling.

Yue Fan*, Xiaojian Ma*, Rongpeng Su, Jun Guo, Rujie Wu, Xi Chen, Qing Li
Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
ICCV 2025 / arXiv / Project
A memory framework for embodied agents in household environments (and beyond!)
Spotlight presentation.

Team CraftJarvis Icon
ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting
CVPR 2025 / arXiv / Project
Learning to interact with all your surroundings, from self-supervision, enables open-world agents :)

Zhi Gao*, Bofei Zhang*, Pengxiang Li*, Xiaojian Ma, Yue Fan, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li,
Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage
ICLR 2025 / arXiv / Project
Spotlight presentation.

Team CraftJarvis Icon
JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models
T-PAMI / arXiv / Project / Code 
Embodied RAG meets open-world agents.

Haozhe Zhao*, Xiaojian Ma*, Liang Chen, Shuzheng Si, Rujie Wu Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li Baobao Chang
UltraEdit: Instruction-based Fine-Grained Image Editing at Scale
NeurIPS 2024 D&B Track / arXiv / Demo / Project / Code
Free-form and region-based image editing made easy with language.

Team CraftJarvis Icon
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents
NeurIPS 2024 / arXiv / Project / Code 
Native modeling of multimodal interaction (VLA) data.

Yue Fan*, Xiaojian Ma*, Rujie Wu Yuntao Du, Jiaqi Li, Zhi Gao, Qing Li
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
ECCV 2024 / arXiv / Project
Zero-shot long-form video understanding, shrinking the gap to Gemini.

Ziyu Zhu, Zhuofan Zhang, Xiaojian Ma, Xuesong Niu, Yixin Chen, Baoxiong Jia, Zhidong Deng, Siyuan Huang, Qing Li
Unifying 3D Vision-Language Understanding via Promptable Queries
ECCV 2024 / arXiv / Project / Code 
Unifying open-vocabulary perception and reasoning in 3D world.

Xiaojian Ma*, Jiangyong Huang*, Silong Yong*, Xiongkun Linghu*, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang
LEO: An Embodied Generalist Agent in 3D World
ICML 2024 / arXiv / Project / Code / Demo / Video 

Ran Gong*, Xiaojian Ma*, Qiuyuan Huang*, Hoi Vo, Zane Durante Yusuke Noda, Zilong Zheng, Song-Chun Zhu Demetri Terzopoulos, Li Fei-Fei, Jianfeng Gao
MindAgent: Emergent Gaming Interaction
NAACL 2024 Findings / Paper / arXiv / Project / Code
Benchmark and other infra for LLM + general multi-player gaming.

Zhi Gao, Yuntao Du, Xintong Zhang, Xiaojian Ma, Wenjuan Han, Song-Chun Zhu, Qing Li,
CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update
CVPR 2024 / arXiv / Project
A self-improved language agent that sharpens its tools.

Team CraftJarvis Icon
GROOT: Learning to Follow Instructions by Watching Gameplay Videos
ICLR 2024 / Paper / arXiv / Project / Code 
Spotlight presentation. Closing the human-machine gap on instruction following, measured by Elo rating.

Haozhe Zhao*, Zefan Cai, Zhuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, Baobao Chang,
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
ICLR 2024 / Paper / arXiv / Demo / Code
Ranked 1st on MMBench and MME in 08/2023.

Rujie Wu* Xiaojian Ma*, Qing Li, Wei Wang, Zhenliang Zhang, Song-Chun Zhu Yizhou Wang
Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World
ICLR 2024 / Paper / arXiv / Project / Code 
The third chapter of the Bongard trilogy, for the LM era (chapter 1, chapter 2)

Team CraftJarvis Icon
Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents
NeurIPS 2023 / Paper / arXiv / Project / Code 
Best paper award, ICML-23 TEACH Workshop

Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, Qing Li
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment
ICCV 2023 / arXiv / Project / Code 
A new 3D-Language foundation model

Team CraftJarvis Icon
Open-World Multi-Task Control Through Goal-Aware Representation Learning and Adaptive Horizon Prediction
CVPR 2023 / arXiv / Project / Code 

Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, Siyuan Huang
SQA3D: Situated Question Answering in 3D Scenes
ICLR 2023 / Paper / arXiv / Slides / Project / Code / Benchmark 
A new quest: embodied scene understanding

Peyi Yu, Sirui Xie, Xiaojian Ma, Baoxiong Jia, Bo Pang, Ruiqi Gao, Yixin Zhu, Song-Chun Zhu and Ying Nian Wu
Latent Diffusion Energy-Based Model for Interpretable Text Modeling
ICML 2022 / Paper / arXiv / Code 

Huaizu Jiang*, Xiaojian Ma*, Weili Nie, Zhiding Yu, Yuke Zhu, Song-Chun Zhu, Anima Anandkumar
Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions
CVPR 2022 / Paper / Poster / Slides / Project / arXiv / Code / Bibtex 
Oral presentation

Xiaojian Ma, Weili Nie, Zhiding Yu, Huaizu Jiang, Chaowei Xiao, Yuke Zhu, Song-Chun Zhu, Anima Anandkumar
RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning
ICLR 2022 / Paper / Poster / Slides / Project / OpenReview / arXiv / Code / Bibtex 

Experience


Contact

jeasinema [at] gmail [dot] com
[Google Scholar]  |  [GitHub]  


© Xiaojian Ma 2026