by Mohan S Kankanhalli and Philippe Mulhem
hat constitutes the most tedious task a holiday photographer faces after a wonderful vacation abroad? For the typical tourist, this would most likely be the endlessly time-consuming job of organizing video clips and photographs into some semblance of order. Wouldn't it be great if these images could be summoned according to places or faces so the organizer could decide which to pick and what to file? And how about adding some clever captions that bring back happy memories of the trip?
The following scenario could become reality sooner than you think. Mike and Mary come back from their dream honeymoon in Paris, eager to share their wonderful experience with their relatives and friends. Mike has used his video camera enthusiastically if somewhat inexpertly. He has hundreds of hours of footage to scroll through to locate those highlights he wants to show his loved ones.
With something like the DIVA system, Mike can say, "Find the shot that features Mary under the Eiffel Tower" and have the system locate just that frame. With the help of the editing interface tracking tool and presentation software, he quickly annotates his video footage, thereby generating an impressive segment on Mary and the famous tower.
The Digital Image and Video Album (DIVA) is a system under development by the National University of Singapore's School of Computing, the Communication Langagière et Interaction Personne-Système (CLIPS) Laboratory in Grenoble, France, and the Singapore-based Kent Ridge Digital Laboratory (KRDL). It aims to solve anticipated problems that will result from the massive growth of new digital media as well as the digitization of traditional content. Team members hope to develop techniques for content-based indexing and intuitive access and retrieval of digital images that even the novice user can handle easily.
The research aims to employ an object-based approach, which uses description of objects and people present in the images, to bridge the gap between the content-feature space (like colors and shapes) and the semantic space (the human being's point of view). The project aggregates the content features to obtain intermediate-level objects. Then application of computer-vision and pattern-recognition techniques attach meaningful descriptions to these objects for the purpose of indexing and retrieving images and videos.
Team members have already obtained promising preliminary results in human-face detection, image segmentation, class-based indexing, and image retrieval. This project has as its goal the development of prototypes of next-generation digital home-photo and home-video albums that can demonstrate the potential utility and efficacy of the research ideas developed.
Digital Video Album
Because of the huge volume in video databases, accessing and retrieving video items make the work a tedious and time-consuming undertaking. Indexing video data facilitates this process: DIVA focuses on developing a truly usable digital home-video-album system that can be bundled with camcorders, allowing even novice users to store, retrieve, compose, and share their home videos.
Existing approaches for organizing video collections can be classified into two main categories: feature-based and annotation-based indexing. Feature-based indexing provides access to video data based on visual information, like color - but is inadequate for describing the content of videos. Manual annotation, despite its being more descriptive, tends to be time-consuming and laborious. Other solutions available in the market, such as MovieMaker, use free-text annotation with detection for a simple shot - not quite adequate for precise retrieval. The team members have decided to improve on this method by incorporating advanced features and tools that will enable proper indexing of videos.
DIVA aims at developing novel techniques based on pseudo-objects that bridge the gap between syntactic features and annotation-based approaches. Moreover, a flexible indexing language based on XML (eXtensible Markup Language, which allows the easy interchange of documents on the World Wide Web) captures such collection metadata as people's names, locations (where the videos are taken), and so on, and supports the search of home videos. Once the user retrieves a subset of videos from the database, the system will provide flexible tools for composing the results into a presentation that can then be shared over the Web.
Two ways are available to obtain an object-based representation. One identifies objects first, then studies and defines the best features (colors, textures, and shapes) and the best measures used to recognize them. This method creates high-level interpretations for descriptive queries. Thus a critical component of content-video analysis will involve automatic detection and recognition of specific types of objects (like human faces) in video frames. The signature-based approach looks into the content-representation issue from another perspective. Signatures consist of such aspects as color, intensity gradient, and their changes; they correspond to certain objects at certain states. The team is currently focusing on both methods.
When editing home-video albums, the user also performs indexing and retrieval tasks. The researchers concentrate on presenting automatic low-level tables of contents to facilitate indexing and archiving, having developed an automatic summarizing process of videos based on key-content features. For instance, when a user inserts the face of a person into the database, the system will recognize and track it without any further input. Later, during the editing or retrieval process, the system can utilize this information to retrieve similar parts of a video.
The next step is to design frameworks in XML for intuitive description of the home-video metadata, on which a query language will allow even a lay-end user to express his or her questions effectively. A strong need exists to provide a link between the objects in a video and their semantic associations. Since the feature that distinguishes video is the temporal facet, the process requires semantic labeling of objects and propagating labels throughout the video, using object tracking, without decompressing the video. The Intelligent Video Composition Tool constitutes an authoring tool that allows non-technical users to define a presentation utilizing their own resources.
To ensure optimal performance in computing power, the team has its processes operate in the "compressed" domain, without decoding the video. The investigators limit the results to ten times the real time so that one video minute takes no more than ten minutes to analyze. Most of the processes can be performed offline, which means that the analyses can be done without the user's presence.
Digital Image Album
With the ubiquity of the digital camera, managing thousands of home photos has indeed become a challenge. The research team plans to develop a powerful yet intuitive digital home-photo album that can enable home users to control their ever-growing collections of digital photos.
Existing digital photo-categorization techniques are still based on either low-level visual features or manual annotation. Usually difficult to understand, they can be particularly trying for the home user. The developers hit on the idea of categorization based on the semantic meaning of the images and subsequent linkup of the semantic meaning with low-level visual features.
Most current approaches to image indexing - FlipAlbum, FotoTime, and other photo album software - have associated text as a base. Structured metadata can come in a predefined format, such as information on people containing their names, birthdays, and so on, which allows efficient retrieval and management. But manual classification takes time, and even experienced indexers cannot anticipate all the potential uses for a particular photograph. With this concern in mind, the team designed the digital-image album to do indexing based on the elements a photograph contains, thereby realizing automatic annotation or semi-automatic annotation.
The team members also seek to provide a flexible searching method. Their categorization and annotation technique can support searching by text and searching by example. They emphasize visualization strategies so that browsing can be done more effectively. For most people, seeing a very large number of photos at once with similar ones in clusters serves as an important feature for browsing. The album can automatically arrange thumbnail images according to their similarities, making use of multidimensional scaling to produce an accurate visualization of a large number of such images.
To achieve effective annotation, efficient content-feature extraction is important. Most current techniques focus only on the global features (for instance, the color histogram of the whole image) or partition-based image features (for instance, splitting the image into 16 squares and then computing a color histogram for each of the squares). Since the investigators pay attention to photograph elements, the emphasis falls on region-based features that pre-segment an image by color and/or texture into regions for matching.
For colors, team members find the adaptive-color histogram very effective in image retrieval and classification. For textures, they have constructed a perceptual feature that allows the computer to estimate differences among textures more consistently than a human being is capable of. For shapes, they match three-dimensional (3-D) models to objects in 2-D images.
The first version of the image- and video-album prototypes already shows great promise, reinforcing the researchers' conviction that concept-based description of visual content is a must for home digital albums. They plan to commercialize it by mid-2003.
This team works with another KRDL group on the imaging aspect of the project. (For more information see INNOVATION Volume 2 Number 2, page 5.) It also considers alternative approaches by including powerful knowledge representation so as to enhance, among other functions, query interaction.
For more information contact Mohan S Kankanhalli at [email protected]
or Philippe Mulhem at [email protected]. Also check out https://diva.comp.nus.edu.sg.