Apple’s FastVLM video captioning model now runs directly in your browser

Apple’s FastVLM video captioning model can now be tested in-browser via Hugging Face. The lighter 0.5B version runs locally, processes video in real time, and highlights Apple’s push for fast, private, on-device AI.

Ayush Mukherjee

September 02, 2025 / 14:44 IST

Apple

Apple’s work on lightweight AI continues to impress. The company’s FastVLM (Visual Language Model), first announced a few months ago, can now be tried directly in the browser via Hugging Face. Originally available only through GitHub and designed to run on Apple Silicon Macs, the demo makes it easier than ever to see Apple’s model in action.

FastVLM was built on MLX, Apple’s in-house machine learning framework for Apple Silicon. The model stands out for its efficiency: up to 85 times faster at video captioning and more than three times smaller compared to similar systems. The browser demo uses the lighter FastVLM-0.5B version, which makes it possible to test it without heavy hardware demands.

Loading the model can take a few minutes depending on your device, but once running, it generates accurate, real-time captions. It can describe facial expressions, background details, objects in view, and even respond to tailored prompts like “Describe what you see in one sentence” or “What is the color of my shirt?”

Because it runs locally in the browser, the demo keeps all data on-device and can even function offline, an approach with strong potential for wearables and assistive tech, where speed and privacy are critical.

While the demo showcases the smaller 0.5B model, Apple has also released larger FastVLM variants with 1.5B and 7B parameters. These could deliver even better performance, though running them entirely in-browser would be less practical.