by Dr Liyanage C De Silva
ven though more than a century has passed since Alexander Graham Bell invented the telephone in 1876, it is quite amazing how little the technology has changed.
Most of us are still using our most primeval communication method - voice - to communicate with another party. Video telephony and Internet-based video-conferencing have not had much impact; in fact, a lot of people have reservations about sending their own video images to the other party. It is still a mystery why there is this instinctive aversion to chatting 'face-to-face' with someone else.
However, it is a different story with text-based chat systems. These are growing at a tremendous rate on the Internet today. This interesting aspect of human behaviour suggests that people are uncomfortable about revealing their emotional state, and can converse more freely if they hide their identity.
However, text-based communication can be monotonous and tedious. For avid Internet chatters jaded by text-based chat systems, help is at hand.
Research is being carried out at the National University of Singapore's Department of Electrical and Computer Engineering to design and implement a three-dimensional (3D) model-based facial animation system that can be incorporated into a 3D visual chat environment. A 3D avatar or graphical image is created to represent the user, complete with expressions, while he or she conducts an online conversation.
The interactive 3D model-based text-to-audio-visual synthesis (TTAVS) system can be an alternative for low bandwidth video-conferencing or informal chat sessions.
The system incorporates a 3D model of a human head with facial animation parameters (emotion parameters) and speech producing capabilities (lip-sync). At the transmitter side, the user inputs text sentences via the keyboard, which are sent through the communication channel to the correspondent's PC. At the receiving end, the system converts incoming text into speech. The receiver sees a 3D head model - with appropriate facial emotions and lip movements - and hears speech corresponding to the text sent.
The user can use a predefined set of symbols to express certain emotions, which in turn is reproduced at the receiving end. Thus, the chat session is enhanced, although the quality of high bandwidth video-conferencing cannot be reached.
There are advantages to this approach. It eliminates existing problems in video-conferencing due to transfers of large data packets, while still providing a reasonably natural image appearance.
The system can work in a similar way in duplex mode, which basically allows the transmitter and the receiver to switch roles at will. The entire process can also be implemented as a virtual chat room with more than two users.
The visual chat system can also be potentially applied in a classroom, where teaching can be delivered in a more interesting manner. In long-distance learning, students could interact online with their teachers, with exchanges expressed through 3D avatars.
The system can also be fine-tuned for computer game enthusiasts to create virtual worlds and 3D scenarios that are more interactive and realistic.
The researchers have already addressed the lip-synchronisation issues and developed a fully working 3D model of human lips with a database of the most common lip shapes.
Non-Uniform Rational B-Splines (NURBS) surfaces were used to model the lips and 3D face. NURBS have been created using computers specifically for 3D modelling to represent contours or shapes. In NURBS modelling, the surface is not defined by joining points in 3D space as for a conventional polygonal 3D model, but is given its shape by control points. When the control points are moved, the shape of the NURBS surfaces also changes, thus retaining the smoothness of the underlying surface.
The various parameters required for a realistic appearance of the lips were obtained from the video clips of a natural speaker. The entire system has been designed with existing sound wave editors and 3D modelling and animation software. The final human head NURBS is incorporated into an interactive model-based visual chat environment .
The research team is also exploring the possibility of using Festival Speech system at the input end of the system to convert text into speech.
Festival Speech system is a text-to-speech software developed at the University of Edinburgh which can extract intermediate phoneme information for lip-sync. The researchers are looking into the design and methodology for a text-to-audio-visual system that can input a string of phonemes or speech sounds, and output them in the form of "talking lips".
The NURBS used in the 3D models is a relatively new concept in 3D lips modelling, but a recent demonstration of the work has attracted the interest of a number of 3D graphics companies. The concept is being expanded into the area of computer graphics-based sign language generation, in collaboration with the Human Interface Engineering Lab at the University of Osaka, Japan.
The main idea here is to convert typed text or spoken words into animated computer graphics depicting sign language. This concept can be used in communication with hearing-impaired people.
Dr Liyanage C De Silva is an Assistant Professor at the Department of Electrical and Computer Engineering, NUS.
Hari Gurung is a Master of Engineering student working with De Silva on the project.
For more information, contact De Silva at: [email protected] or check out: https://face.ee.nus.edu.sg