|
Multimodal Fusion for Image Classification |
| Description |
|
Conventional image categorization techniques primarily rely
on low-level visual cues, such as color, shape, and texture. In this paper, we
describe a multimodal fusion scheme which improves image classification
accuracy by incorporating the information derived from embedded text
detected in the image that is being classified. Specific to each
image category, a text concept is learned from labeled texts, which are
the areas containing text identified by the user, using Multiple
Instance Learning. For an image under
classification containing multiple detected text lines, we calculate a
weighted
Euclidian distance between each text line and the learned text concept
of the
target category. Subsequently, the minimum distance, along with
low-level visual
cues, are jointly used as the features for Support Vector Machine
(SVM)-based classification. Experiments on a challenging image database
demonstrate
that the proposed fusion framework achieves a higher accuracy than the
state-of-art methods for image classification.
Some sample images of image categorization with the text detection output
A few examples from the “Camera” category with labeled texts of camera brands and models |
| Publication |
| • Qiang Zhu, Mei-Chen Yeh and Kwang-Ting Cheng. Multimodal
Fusion using Learned Text Concepts for Image Categorization. ACM
International Conference on Multimedia 2006 (ACM MM 06), October
23-27, (presentation slides) |
| Dataset |
| • imET (Images with Embedded Texts) dataset |
| <back to research> |