Im2Text and Text2Im: Associating Images and Texts for Cross-Modal Retrieval
In Proceedings British Machine Vision Conference 2014
AbstractBuilding bilateral semantic associations between images and texts is among the fundamental problems in computer vision. In this paper, we study two complementary cross-modal prediction tasks: (i) predicting text(s) given an image ("Im2Text"), and (ii) predicting image(s) given a piece of text ("Text2Im"). We make no assumption on the specific form of text; i.e., it could be either a set of labels, phrases, or even captions. We pose both these tasks in a retrieval framework. For Im2Text, given a query image, our goal is to retrieve a ranked list of semantically relevant texts from an "independent" text-corpus (i.e., texts with no corresponding images). Similarly, for Text2Im, given a query text, we aim to retrieve a ranked list of semantically relevant images from a collection of ``unannotated'' images (i.e., images without any associated textual meta-data). We propose a novel Structural SVM based unified formulation for these two tasks. For both visual and textual data, two types of representations are investigated. These are based on: (1) unimodal probability distributions over topics learned using latent Dirichlet allocation, and (2) explicitly learned multi-modal correlations using canonical correlation analysis. Extensive experiments on three popular datasets (two medium and one web-scale) demonstrate that our framework gives promising results compared to existing models under various settings, thus confirming its efficacy for both the tasks.
FilesExtended Abstract (PDF, 1 page, 799K)
Paper (PDF, 13 pages, 1.0M)
Supplemental Materials (ZIP, 89K)