Welcome to my academic homepage. I am an AI researcher. I received my Ph.D. in Computer Science at University of California, Los Angeles (UCLA) . I went to Tsinghua University for undergraduate study in Computer Science.

I work on multimodal learning for understanding, reasoning and skill learning. In particular, I'm interested in building models/agents that can learn from 2D/3D vision and text data, and perform a wide range of reasoning and embodied control tasks. Some of my research keywords can be found below:

  • Multimodal learning: Vision and language, Visual reasoning, 3D vision, Generalist models
  • Representation learning: Zero-shot and few-shot learning, Generative model
  • Embodied agents: Reinforcement learning and imitation, Robotics, Sensor fusion
Email: jeasinema [at] gmail [dot] com / Google Scholar / LinkedIn

News


Selected Publications


Preprint

Jun Guo*, Xiaojian Ma*, Yue Fan, Huaping Liu, Qing Li
Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting
arXiv preprint / arXiv / Project

Team CraftJarvis Icon
🐭 RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation
arXiv preprint / arXiv / Project / Demo 
Significant improvement on creativity writing, codegen, and math problems.

Team CraftJarvis Icon
JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models
arXiv preprint / arXiv / Project / Code 
Embodied RAG meets open-world agents.

Sirui Xie, Xiaojian Ma, Peiyu Yu, Yixin Zhu, Ying Nian Wu and Song-Chun Zhu
HALMA: Humanlike Abstraction Learning Meets Affordance in Rapid Problem Solving
arXiv preprint / Paper / Click Here to Play HALMA!  / arXiv 


Conference

Haozhe Zhao*, Xiaojian Ma*, Liang Chen, Shuzheng Si, Rujie Wu Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li Baobao Chang
UltraEdit: Instruction-based Fine-Grained Image Editing at Scale
NeurIPS 2024 D&B Track / arXiv / Demo / Project / Code
Free-form and region-based image editing made easy with language.

Team CraftJarvis Icon
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents
NeurIPS 2024 / arXiv / Project / Code 
Native modeling of multimodal interaction (VLA) data.

Yue Fan*, Xiaojian Ma*, Rujie Wu Yuntao Du, Jiaqi Li, Zhi Gao, Qing Li
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
ECCV 2024 / arXiv / Project
Zero-shot long-form video understanding, shrinking the gap to Gemini.

Ziyu Zhu, Zhuofan Zhang, Xiaojian Ma, Xuesong Niu, Yixin Chen, Baoxiong Jia, Zhidong Deng, Siyuan Huang, Qing Li
Unifying 3D Vision-Language Understanding via Promptable Queries
ECCV 2024 / arXiv / Project / Code 
Unifying open-vocabulary perception and reasoning in 3D world.

Xiaojian Ma*, Jiangyong Huang*, Silong Yong*, Xiongkun Linghu*, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang
LEO: An Embodied Generalist Agent in 3D World
ICML 2024 / arXiv / Project / Code / Demo / Video 

Ran Gong*, Xiaojian Ma*, Qiuyuan Huang*, Hoi Vo, Zane Durante Yusuke Noda, Zilong Zheng, Song-Chun Zhu Demetri Terzopoulos, Li Fei-Fei, Jianfeng Gao
MindAgent: Emergent Gaming Interaction
NAACL 2024 Findings / Paper / arXiv / Project / Code
Benchmark and other infra for LLM + general multi-player gaming.

Zhi Gao, Yuntao Du, Xintong Zhang, Xiaojian Ma, Wenjuan Han, Song-Chun Zhu, Qing Li,
CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update
CVPR 2024 / arXiv / Project
A self-improved language agent that sharpens its tools.

Team CraftJarvis Icon
GROOT: Learning to Follow Instructions by Watching Gameplay Videos
ICLR 2024 / Paper / arXiv / Project / Code 
Spotlight presentation. Closing the human-machine gap on instruction following, measured by Elo rating.

Haozhe Zhao*, Zefan Cai, Zhuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, Baobao Chang,
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
ICLR 2024 / Paper / arXiv / Demo / Code
Ranked 1st on MMBench and MME in 08/2023.

Rujie Wu* Xiaojian Ma*, Qing Li, Wei Wang, Zhenliang Zhang, Song-Chun Zhu Yizhou Wang
Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World
ICLR 2024 / Paper / arXiv / Project / Code 
The third chapter of the Bongard trilogy, for the LM era (chapter 1, chapter 2)

Team CraftJarvis Icon
Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents
NeurIPS 2023 / Paper / arXiv / Project / Code 
Best paper award, ICML-23 TEACH Workshop

Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, Qing Li
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment
ICCV 2023 / arXiv / Project / Code 
A new 3D-Language foundation model

Team CraftJarvis Icon
Open-World Multi-Task Control Through Goal-Aware Representation Learning and Adaptive Horizon Prediction
CVPR 2023 / arXiv / Project / Code 

Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, Siyuan Huang
SQA3D: Situated Question Answering in 3D Scenes
ICLR 2023 / Paper / arXiv / Slides / Project / Code / Benchmark 
A new quest: embodied scene understanding

Peyi Yu, Sirui Xie, Xiaojian Ma, Baoxiong Jia, Bo Pang, Ruiqi Gao, Yixin Zhu, Song-Chun Zhu and Ying Nian Wu
Latent Diffusion Energy-Based Model for Interpretable Text Modeling
ICML 2022 / Paper / arXiv / Code 

Huaizu Jiang*, Xiaojian Ma*, Weili Nie, Zhiding Yu, Yuke Zhu, Song-Chun Zhu, Anima Anandkumar
Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions
CVPR 2022 / Paper / Poster / Slides / Project / arXiv / Code / Bibtex 
Oral presentation

Xiaojian Ma, Weili Nie, Zhiding Yu, Huaizu Jiang, Chaowei Xiao, Yuke Zhu, Song-Chun Zhu, Anima Anandkumar
RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning
ICLR 2022 / Paper / Poster / Slides / Project / OpenReview / arXiv / Code / Bibtex 

Peyi Yu, Sirui Xie, Xiaojian Ma, Yixin Zhu, Ying Nian Wu and Song-Chun Zhu
Unsupervised Foreground Extraction via Deep Region Competition
NeurIPS 2021 / Paper / arXiv / Code 

Mingxuan Jing, Wenbing Huang, Fuchun Sun, Xiaojian Ma, Tao Kong, Chuang Gan and Lei Li
Adversarial Option-Aware Hierarchical Imitation Learning
ICML 2021 / Paper / arXiv / Code 
Spotlight presentation

Mark Edmonds, Xiaojian Ma, Siyuan Qi, Yixin Zhu, Hongjing Lu and Song-Chun Zhu
Theory-based Causal Transfer: Integrating Instance-level Induction and Abstract-level Structure Learning
AAAI 2020 / Paper / Project Page / arXiv / Code 
Oral presentation

Xiaojian Ma*, Mingxuan Jing*, Wenbing Huang, Fuchun Sun, Bin Fang and Huaping Liu
Reinforcement Learning from Imperfect Demonstrations under Soft Expert Guidance
AAAI 2020 / Paper / Project Page / arXiv / Code 
also in Structure & Priors in Reinforcement Learning Workshop @ ICLR 2019

Xiaojian Ma*, Chao Yang*, Wenbing Huang*, Fuchun Sun, Huaping Liu, Junzhou Huang and Chuang Gan,
Imitation Learning from Observations by Minimizing Inverse Dynamics Disagreement
NeurIPS 2019 / Paper / Project Page / arXiv / Code 
Spotlight presentation

Hongzhuo Liang, Shuang Li, Xiaojian Ma, Norman Hendrich, Timo Gerkmann, Fuchun Sun and Jianwei Zhang
Making Sense of Audio Vibration for Liquid Height Estimation in Robotic Pouring
IROS 2019 / Paper / Project Page / arXiv / Code / Video 

Xiaojian Ma*, Hongzhuo Liang*, Shuang Li, Michael Görner, Song Tang, Bin Fang, Fuchun Sun and Jianwei Zhang
PointNetGPD: Detecting Grasp Configurations from Point Sets
ICRA 2019 / Paper / Project Page / arXiv / Code / Video 

Xiaojian Ma*, Shuang Li*, Hongzhuo Liang, Michael Görner, Philipp Ruppel, Bin Fang, Fuchun Sun and Jianwei Zhang
Vision-based Teleoperation of Shadow Dexterous Hand using End-to-End Deep Neural Network
ICRA 2019 / Paper / Project Page / arXiv / Code / Video 

Xiaojian Ma*, Mingxuan Jing*, Wenbing Huang, Fuchun Sun and Huaping Liu
Task Transfer by Preference-Based Cost Learning
AAAI 2019 / Paper / Project Page / arXiv / Code 
Spotlight presentation

Experience


Contact

jeasinema [at] gmail [dot] com
[Google Scholar]  |  [GitHub]  


© Xiaojian Ma 2024