The British Machine Vision Association : Trustworthy Multimodal Learning with Foundation Models: Bridging the Gap between AI Research and Real World Applications. (24th April 2024)

One Day Meeting: Trustworthy Multimodal Learning with Foundation Models: Bridging the Gap between AI Research and Real World Applications

Wednesday 24 April 2024

Chairs: Chao Zhang (Toshiba Europe Ltd), Jindong Gu (University of Oxford), Shitong Sun (Queen Mary University of London), Onay Urfalioglu (Vivo Tech GmbH)

< vivo

We invite academic and industry presentations, bringing together researchers interested in all aspects of foundational models (GPT-4, CLIP, SAM, etc) and multimodal learning involving, but not limited to, image, video, audio, depth, text, drawings, laser, IMU, etc.

Please register via charitysuite on this link: Register Here

Invited Speakers

Guohao Li (University of Oxford & CAMEL-AI.org )
Oleg Sinavski (Wayve, London)
Da Li (Samsung Research)
Rudra Poudel (Toshiba Europe)
Ashkan Khakzar (University of Oxford)

Programme

Start	End	Title
09:00	09:15	Registration/Poster Set-up
09:15	09:20	Opening Remarks
09:20	10:00	Invited Speaker - Guohao Li,
10:00	10:40	Invited Speaker - Oleg Sinavski
10:40	11:05	Coffee Break + Posters
11:05	12:20	Accepted Talks - Pt. 1
12:20	13:20	Lunch + Posters
13:20	14:00	Invited Speaker - Da Li
14:00	15:15	Accepted Talks - Pt. 2
15:15	15:40	Coffee Break + Posters
15:40	16:20	Invited Speaker - Rudra Poudel
16:20	17:00	Invited Speaker - Ashkan Khakzar
17:00	17:05	Past, Present, and Future of Vision-Language

Talk Part 1 (15 mins each)

Hang Dai (University of Glasgow) Multimodal BEV Fusion for Autonomous Driving
Zhening Huang (University of Cambridge) OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation
Jian Hu (Queen Mary University of London) Is Instance-specific Manual Prompt Necessary for Promptable Semantic Segmentation?
Xingchen Zhang (Imperial College London) Self-supervised RGBT tracking with Cross-input consistency
Yongshuo Zong (University of Edinburgh) Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models

Talk Part 2 (15 mins each)

Chengzu Li (University of Cambridge) On Task Performance and Model Calibration with Supervised and Self-Ensembled In-Context Learning
Ziquan Liu (Queen Mary University of London) Borrowing Treasures from Neighbors: In-Context Learning for Multimodal Learning with Missing Modalities and Data Scarcity
Yimeng Gu (Queen Mary University of London) Domain Adaptive Multimodal Out-of-context News Detection
Yinghao Ma (Queen Mary University of London) MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response
Chen Chen (University of Sheffield) Unlocking the Value of Single Modality through Multi-Modal Knowledge Transfer with Large Language Models

Posters

Cangxiong Chen (University of Bath) Understanding the Vulnerability of CLIP to Image Compression
Charlie Grimshaw (University of Sheffield) Using Large Vision Language Models to detect Propaganda Techniques in memes
Dean Slack (Durham University) Enhancing Next-Frame Video Prediction through Linguistic Scene Understanding
Anum Masood (Harvard Medical School) Advancing Accuracy in Multimodal Medical Tasks through Bootstrapped Language-Image Pretraining (BLIP)

Meeting Location

The meeting will take place at:

British Computer Society (BCS), 25 Copthall Avenue, London EC2R 7BP

Registration

We keep the cost of attending these events as low as possible to ensure no barriers from the whole computer vision community attending. The registration costs are as follows

BMVA Members: £20
Non BMVA Members £40 (Includes membership to the BMVA for 2024)

Both include lunch and refreshments for the day