
One Day Meeting: BMVA Symposium on Multimodal Large Models: Bridging Vision, Language, and Beyond
Wednesday 5 November 2025
Chairs: Jian Hu (Queen Mary, University of London), Prof Jun Liu (Lancaster University), Dr Ziquan Liu (Queen Mary, University of London) , Dr Wei Zhou (Cardiff University)
Queries? contact the Meeting's Organiser Andrew Gilbert here
Invited Speakers
- Prof. Dima Damen, University of Bristol and Google DeepMind
- Dr. Xiatian Zhu, University of Surrey
- Prof. Niloy Mitra, University College London
Overview of Meeting
We are excited to invite researchers, engineers, and practitioners from academia and industry to participate in a one-day symposium focused on Multimodal Learning. This event aims to bring together diverse voices working on the next generation of intelligent systems that integrate vision, language, audio, and other modalities. We welcome submissions spanning early-stage ideas, in-progress studies, and previously published work that align with the symposium’s theme.
This workshop will explore how models can learn from and reason across different modalities to achieve richer semantic understanding, robust generalisation, and responsible deployment. We especially encourage contributions that push the boundaries of current multimodal systems, both in theoretical foundations and real-world applications.
Programme
Start | End | Title | ||
---|---|---|---|---|
09:30 | 09:45 | Registration/Poster Set-up + Coffee | ||
09:45 | 09:50 | Opening Remarks | ||
09:50 | 10:30 | Invited Keynote Speaker – Dima Damen | ||
10:30 | 11:10 | Invited Keynote Speaker – Niloy Mitra | ||
11:10 | 11:30 | Coffee Break + Posters | ||
11:30 | 12:30 | Accepted Talks Pt. 1 - Advancing 3D, Video, and Multimodal Understanding | ||
12:30 | 13:30 | Lunch + Posters | ||
13:30 | 14:10 | Invited Keynote Speaker – Xiatian Zhu | ||
14:10 | 15:10 | Accepted Talks Pt. 2 – Bias, Fairness, and Security in Assistive AI | ||
15:10 | 15.30 | Coffee Break + Posters | ||
15:30 | 16:30 | Accepted Talks Pt. 3 – Domain Applications and Human-Centric Modalities | ||
16:30 | 16:55 | Panel Discussion + Q&A | ||
16:55 | 17:00 | Closing Remarks |
Accepted Talks
10 min each + 2 mins questions
Oral 1: Advancing 3D, Video, and Multimodal Understanding
Hunar Batra – SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards
Ye Mao – Hypo3D: Exploring Hypothetical Reasoning in 3D
Zengqun Zhao – LatSearch: Latent Reward-Guided Inference Time Search for Scaling Video Generation
Kevin Qinghong Lin – Beyond Video Understanding: How Video can advance research and education
Sam Pollard – A Video Is Not Worth a Thousand Words
Oral 2: Reliability, Learning Frameworks, and Model Adaptation
Pramit Saha – Federated Finetuning of Vision-Language Foundation Models
Jiankang Deng – RICE: Region-Aware Cluster Embedding for Vision Representation Learning
Zhaohan Zhang – GrACE: A Generative Approach to Better Confidence Elicitation in Large Language Models
TBC
TBC
Oral 3: Domain Applications and Human-Centric Modalities
Edward Fish – Hyperbolic Contrastive Regularisation for Geometrically Aware Sign Language Translation
Yinghao Ma – MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
TBC
TBC
TBC
Important: This is an in-person event held at the British Computer Society, with no virtual presentation option. We kindly ask all presenters to join us on-site.
We look forward to seeing your work!
Meeting Location
The meeting will take place at:
British Computer Society (BCS), 25 Copthall Avenue, London EC2R 7BP
Registration
We keep the cost of attending these events as low as possible to ensure no barriers from the whole computer vision community attending. The registration costs are as follows
All Attendees: £30
Including lunch and refreshments for the day