Moneycontrol PRO
HomeTechnologyWikipedia partners with Kaggle to release AI-friendly dataset and combat bot scraping

Wikipedia partners with Kaggle to release AI-friendly dataset and combat bot scraping

Wikipedia is now partnering with data science platform Kaggle to dissuade bot scraping on its platform. The release of this beta dataset was announced earlier this week, and it features ‘structured Wikipedia content in English and French.’

April 20, 2025 / 14:31 IST
Wiki

Wiki

Popular free online encyclopedia, Wikipedia, has been struggling with AI bots in recent times, which scrape text and multimedia from the platform to train generative artificial intelligence models, leading to increased costs and slower load times for its users. Thankfully, to solve this issue, the Wikimedia Enterprise (which manages Wikipedia’s data) on Wednesday has moved to release a dataset, which has been designed for training AI models to deter bots from bombarding its website.

Wikipedia’s dataset to be hosted by Google’s Kaggle: Key details

According to the organisation, the AI dataset consists of “structured Wikipedia content in English and French”, and as of 15 April includes openly licensed research summaries, short descriptions, image links, infobox data, and article sections. Further, it provides simplified access to machine-readable Wikipedia content that is readily available to AI developers for "modeling, benchmarking, alignment, fine-tuning, and exploratory analysis."

It is being hosted on Google-owned Kaggle, which is an online community for data scientists and machine learning practitioners. Starting with English and French, the foundation will offer stripped-down versions of raw Wikipedia text, excluding any references or markdown code. Wikimedia already has content-sharing agreements with Google and the Internet Archive, but this partnership could make the data more accessible to smaller companies and independent data scientists.

Moreover, this dataset is designed to short-circuit this scraping, not just to reduce the need for this scraping behavior and lower the burden on Wikimedia’s web servers, but also to provide already clean, pre-parsed, and developer-friendly data to Wikipedia users who have been seeing a surge of bot activity on the platform in the last few months.

Invite your friends and family to sign up for MC Tech 3, our daily newsletter that breaks down the biggest tech and startup stories of the day

Sandip Chakraborty
first published: Apr 20, 2025 02:30 pm

Discover the latest Business News, Sensex, and Nifty updates. Obtain Personal Finance insights, tax queries, and expert opinions on Moneycontrol or download the Moneycontrol App to stay updated!

Subscribe to Tech Newsletters

  • On Saturdays

    Find the best of Al News in one place, specially curated for you every weekend.

  • Daily-Weekdays

    Stay on top of the latest tech trends and biggest startup news.

Advisory Alert: It has come to our attention that certain individuals are representing themselves as affiliates of Moneycontrol and soliciting funds on the false promise of assured returns on their investments. We wish to reiterate that Moneycontrol does not solicit funds from investors and neither does it promise any assured returns. In case you are approached by anyone making such claims, please write to us at grievanceofficer@nw18.com or call on 02268882347