Google and Harvard debut dataset with 1m public domain books for AI training

Bitget App

Trade smarter

Cryptopolitan2024/12/13 09:11

By:By Enacy Mapakame

Share link:In this post: The initiative will enhance access to more information for AI firms to train their models. OpenAI and Microsoft funded the Havard project. The nearly one million books span across genres and were scanned as part of Google Books program.

Harvard University, in conjunction with Google, has released a dataset of a million public domain books to train the next generation of AI.

The books span genres, languages, and authors such as Dickens, Dante, and Shakespeare which are no longer copyright protected because of their age. The new dataset initiative comes as AI training data is naturally pricey and best suited for tech firms with deep pockets.

Harvard got financial backing from tech giants

According to a TechCrunch article, the initiative is spearheaded by the Harvard’s Institutional Data Initiative (IDI). This initiative contains books derived from Google’s longstanding book-scanning project Google Books .

Other books contained in the dataset include Czech math textbooks and Welsh pocket dictionaries.

The university teased the IDI in March clearly stating its plans to create a “trusted conduit for legal data for AI.” Since then, not much was heard from it until the formal launch on Thursday and tech giants Microsoft and OpenAI funded the project.

The dataset is not a preserve of Silicon Valley alone but IDI has opened it to anyone, that is from research labs to AI startups that want to train their large language models.

By opening the dataset to anyone, IDI executive director Greg Leppert said the dataset is meant to level the playing field, at a time when the cost of training AI remains high and prohibitive to smaller companies and making it preserve of those with huge budgets.

See also Character.AI faces lawsuit for driving kids into mental health problems

Leppert added that the dataset is “rigorously reviewed,” which according to Fudzilla presumably means someone checked to ensure that Bard was really gone and out of the way.

The Harvard dataset will need more resources

According to Leppert, who compared the dataset’s potential to Linux, the open source operating system, the success of the Harvard dataset will be hinged on a number of variables. Leppert said its success will require more resources, expertise, and a “sprinkle of magic” from those same deep-pocketed corporations that the initiative is designed to challenge.

The million books contained in the dataset were scanned as part of Google Books program. Fudzilla describes the initiative as a digital time capsule from when Google’s ambitions to scan every book seemed quirky rather than dystopian.

However, Leppert is upbeat about the project’s potential uses, further suggesting it could a such a treasure trove helping train AI models for everyone from garage startups to the corporate conglomerates.

While some have praised the initiative as a revolutionary leap forward in democratizing AI, Fudzilla opines that some might see this as a subtle means of ensuring that any ambitious upstart with a few terabytes of server space can now compete in a race to develop the next ChatGPT.

However, they will need more resources to compete and make a dent in the market. ChatGPT launched in November 2022 to immediate success, which spurred the race for generative AI models across the globe. However, the development of these models has created a thirst for data to perfect them and this desire for more data has caused problems on how much information they can get, without stealing it.

See also Google pushes to break Microsoft’s exclusive hold on OpenAI

To date, publishers like the Wall Street Journal and the New York Times have sued OpenAI and Perplexity over using their data without permission.

From Zero to Web3 Pro: Your 90-Day Career Launch Plan

Disclaimer: The content of this article solely reflects the author's opinion and does not represent the platform in any capacity. This article is not intended to serve as a reference for making investment decisions.

PoolX: Locked for new tokens.

APR up to 10%. Always on, always get airdrop.

Lock now!