Moneycontrol
HomeTechnologyWikipedia partners with Kaggle to release AI-friendly dataset and combat bot scraping

Wikipedia partners with Kaggle to release AI-friendly dataset and combat bot scraping

Wikipedia is now partnering with data science platform Kaggle to dissuade bot scraping on its platform. The release of this beta dataset was announced earlier this week, and it features ‘structured Wikipedia content in English and French.’

April 20, 2025 / 14:31 IST
Story continues below Advertisement
Wiki

Popular free online encyclopedia, Wikipedia, has been struggling with AI bots in recent times, which scrape text and multimedia from the platform to train generative artificial intelligence models, leading to increased costs and slower load times for its users. Thankfully, to solve this issue, the Wikimedia Enterprise (which manages Wikipedia’s data) on Wednesday has moved to release a dataset, which has been designed for training AI models to deter bots from bombarding its website.

Wikipedia’s dataset to be hosted by Google’s Kaggle: Key details

Story continues below Advertisement

According to the organisation, the AI dataset consists of “structured Wikipedia content in English and French”, and as of 15 April includes openly licensed research summaries, short descriptions, image links, infobox data, and article sections. Further, it provides simplified access to machine-readable Wikipedia content that is readily available to AI developers for "modeling, benchmarking, alignment, fine-tuning, and exploratory analysis."

It is being hosted on Google-owned Kaggle, which is an online community for data scientists and machine learning practitioners. Starting with English and French, the foundation will offer stripped-down versions of raw Wikipedia text, excluding any references or markdown code. Wikimedia already has content-sharing agreements with Google and the Internet Archive, but this partnership could make the data more accessible to smaller companies and independent data scientists.