HomeTechnologyAlibab's Qwen3-Omni aims to outperform Google’s Nano Banana and OpenAI's GPT-4o

Alibab's Qwen3-Omni aims to outperform Google’s Nano Banana and OpenAI's GPT-4o

Developed under Alibaba Cloud’s Qwen team, Qwen3-Omni is described as the company’s first native end-to-end multimodal platform. In benchmark tests, two of its variants reportedly outperformed GPT-4o and Gemini 2.5-Flash in tasks such as audio recognition and comprehension, as well as image and video understanding.

September 24, 2025 / 12:41 IST
Story continues below Advertisement
Alibaba Qwen
Alibaba Qwen

Alibaba has introduced a new flagship artificial intelligence model, Qwen3-Omni, positioning it against OpenAI’s GPT-4o and Google’s Gemini 2.5-Flash (“Nano Banana”). The multimodal system, launched Tuesday, is designed to handle text, image, audio, and video inputs in one unified model and respond with both text and speech.

Developed under Alibaba Cloud’s Qwen team, Qwen3-Omni is described as the company’s first native end-to-end multimodal platform, according to a report by South China Morning Post. In benchmark tests, two of its variants reportedly outperformed GPT-4o and Gemini 2.5-Flash in tasks such as audio recognition and comprehension, as well as image and video understanding.

Story continues below Advertisement

As per the report, Lin Junyang, a researcher on the project, credited the improvements to large-scale datasets and foundational work in audio processing. “This year, our audio team has spent great efforts on building large-scale audio datasets for both pretraining and post-training,” Lin was quoted as saying in the report.

The model supports inputs in 119 text languages and 19 spoken languages, including English, Chinese, Japanese, Arabic, Spanish, and Urdu. It can generate speech in 10 languages, among them English, Chinese, French, German, and Japanese. In a demonstration, Alibaba showed how devices equipped with cameras, microphones, and speakers could use Qwen3-Omni to perceive their surroundings and respond with natural-sounding speech, according to the report.