Multimodal Fusion for Image Classification


Conventional image categorization techniques primarily rely on low-level visual cues, such as color, shape, and texture. In this paper, we describe a multimodal fusion scheme which improves image classification accuracy by incorporating the information derived from embedded text detected in the image that is being classified. Specific to each image category, a text concept is learned from labeled texts, which are the areas containing text identified by the user, using Multiple Instance Learning. For an image under classification containing multiple detected text lines, we calculate a weighted Euclidian distance between each text line and the learned text concept of the target category. Subsequently, the minimum distance, along with low-level visual cues, are jointly used as the features for Support Vector Machine (SVM)-based classification. Experiments on a challenging image database demonstrate that the proposed fusion framework achieves a higher accuracy than the state-of-art methods for image classification.

figure 1

Some sample images of image categorization with the text detection output

figure 2
A few examples from the “Camera” category with labeled texts of camera brands and models
• Qiang Zhu, Mei-Chen Yeh and Kwang-Ting Cheng. Multimodal Fusion using Learned Text Concepts for Image Categorization. ACM International Conference on Multimedia 2006 (ACM MM 06), October 23-27, Santa Barbara, USA.

(presentation slides)

imET (Images with Embedded Texts) dataset
<back to research>