The MUSTER project is a fundamental pilot research project which introduces a new multi-modal framework for the machine-readable representation of meaning. The focus of MUSTER lies on exploiting visual and perceptual input in the form of images and videos coupled with textual modality for building structured multi-modal semantic representations for the recognition of objects and actions, and their spatial and temporal relations. The MUSTER project will investigate whether such novel multi-modal representations will improve the performance of automated understanding of human language. MUSTER starts from the current state-of-the-work platform for human language representation learning known as text embeddings, but introduces the visual modality to provide contextual world knowledge which text-only models lack while humans possess such knowledge when understanding language. MUSTER will propose a new pilot framework for joint representation learning from text and vision data tailored for spatial and temporal language processing. The constructed framework will be evaluated on a series of HLU tasks (i.e., semantic textual similarity and disambiguation, spatial role labeling, zero-shot learning, temporal action ordering) which closely mimic the processes of human language acquisition and understanding.
MUSTER will rely on recent advances in multiple research disciplines spanning natural language processing, computer vision, machine learning, representation learning, and human language technologies, working together on building structured machine-readable multi-modal representations of spatial and temporal language phenomena.
Contact: Prof. Patrick GALLINARI (coordinator), email@example.com