Popular free online encyclopedia, Wikipedia, has been struggling with AI bots in recent times, which scrape text and multimedia from the platform to train generative artificial intelligence models, leading to increased costs and slower load times for its users. Thankfully, to solve this issue, the Wikimedia Enterprise (which manages Wikipedia’s data) on Wednesday has moved to release a dataset, which has been designed for training AI models to deter bots from bombarding its website.
Wikipedia’s dataset to be hosted by Google’s Kaggle: Key details
According to the organisation, the AI dataset consists of “structured Wikipedia content in English and French”, and as of 15 April includes openly licensed research summaries, short descriptions, image links, infobox data, and article sections. Further, it provides simplified access to machine-readable Wikipedia content that is readily available to AI developers for "modeling, benchmarking, alignment, fine-tuning, and exploratory analysis."
It is being hosted on Google-owned Kaggle, which is an online community for data scientists and machine learning practitioners. Starting with English and French, the foundation will offer stripped-down versions of raw Wikipedia text, excluding any references or markdown code. Wikimedia already has content-sharing agreements with Google and the Internet Archive, but this partnership could make the data more accessible to smaller companies and independent data scientists.
Moreover, this dataset is designed to short-circuit this scraping, not just to reduce the need for this scraping behavior and lower the burden on Wikimedia’s web servers, but also to provide already clean, pre-parsed, and developer-friendly data to Wikipedia users who have been seeing a surge of bot activity on the platform in the last few months.
Discover the latest Business News, Sensex, and Nifty updates. Obtain Personal Finance insights, tax queries, and expert opinions on Moneycontrol or download the Moneycontrol App to stay updated!
Find the best of Al News in one place, specially curated for you every weekend.
Stay on top of the latest tech trends and biggest startup news.