Link to Full Report (PDF)

Contributions:

  • Development of a mixed reality application for real-time transcription and translation using the Magic Leap 2.
  • Integration of OpenAI’s Whisper model with a Flask backend for accurate multilingual transcription and translation.
  • Quantitative and qualitative evaluation through a comprehensive user study.

Authors: Alejandro Cuadron Lafuente, Elisa Martinez Abad, Ruben Schenk, Sophya Tsubin

Institution: ETH Zurich, Institute for Visual Computing

Overview

The project aimed to bridge language barriers in mixed reality (MR) by developing a real-time transcription and translation application for the Magic Leap 2. The application leverages OpenAI’s Whisper model to accurately transcribe and translate spoken language in MR environments, enhancing communication across multilingual users.

Motivation

With the rise of globalisation, the need for seamless multilingual communication has increased, particularly in MR environments. By integrating advanced transcription and translation technologies, our project seeks to facilitate cross-lingual interactions in MR settings, thereby expanding the applicability of MR systems in diverse user scenarios.

Methodology

  1. Hardware and Framework:

    • Magic Leap 2 was selected for its advanced graphics and spatial computing capabilities.
    • Unity3D served as the primary development environment, with Microsoft’s MRTK3 for MR interaction design.
  2. Software Architecture:

    • The backend was developed using Flask, implementing a RESTful API to handle audio data processing and translation requests.
    • OpenAI’s Whisper model was integrated, with a focus on the medium and large model variants for optimal balance between translation speed and accuracy.
  3. Application Workflow:

    • The application consists of two primary scenes:
      • The main menu for language selection and setup.
      • The translation interface, displaying transcriptions and translations in real time.

Results

  • Quantitative Analysis:

    • SUS scores averaged at 87.62 (SD=8.35), indicating a high level of user satisfaction.
    • Translation latency was measured at 4-6 seconds per audio snippet, primarily due to processing delays.
    • A notable correlation was observed between prior MR experience and higher SUS scores (r=0.55, p<0.01).
  • Qualitative Feedback:

    • Users appreciated the intuitive interface but suggested relocating interactive elements to reduce visual clutter.
    • Translations for less common languages (e.g., Romanian, Icelandic) were praised for accuracy.

Conclusion

The MR-ATT project successfully demonstrated the potential of integrating real-time transcription and translation in MR environments using Magic Leap 2 and OpenAI’s Whisper model. Future work will focus on enhancing spatial awareness in the user interface and extending language support beyond English.