DeepSeek introduces sparse attention model to slash API costs

DeepSeek has unveiled an experimental model named V3.2-exp, designed to lower inference costs significantly in long-context operations. The company announced the release on Hugging Face and published a linked research paper on GitHub.

At the core of the new model is DeepSeek Sparse Attention, a mechanism built around two key systems. First, a “lightning indexer” identifies relevant excerpts from the broader context window. Then, a “fine-grained token selection system” narrows down tokens within those excerpts to fit into the limited attention window. This combination allows the model to process long stretches of text with reduced computational load.

Story continues below Advertisement

Remove Ad

Early tests suggest that the approach can cut the price of API calls by nearly half when handling long-context tasks. While more independent evaluations will be needed, the open-weight release on Hugging Face means researchers and developers will quickly put it to the test.

This development comes as part of a wider effort to address inference costs, which are distinct from training expenses and relate to the server resources needed to run a pre-trained model. DeepSeek’s work shows that improvements to the transformer architecture are still possible, even in areas many thought had plateaued.

English

Markets

News

Personal Finance

Mutual Funds

Commodities

Media

Invest Now

Specials

DeepSeek introduces sparse attention model to slash API costs

DeepSeek has launched the V3.2-exp model, introducing Sparse Attention to cut inference costs in long-context tasks by nearly half, with open access on Hugging Face for testing.

Related Stories

Trending Topics

News

Markets

Personal Finance

Mutual Funds

Tools

Community

Network 18 Sites

Quick Links