OpenAI and Google play with double standards: training large models with other people's data, but never allowing their own data to flow out

Editors: Du Wei, Zi Wen

In the new era of generative AI, big tech companies are pursuing a "do as I say, not as I do" strategy when consuming online content. To a certain extent, this strategy can be said to be a double standard and an abuse of the right to speak.

At the same time, as the large language model (LLM) has become the mainstream trend of AI development, both large and start-up companies are sparing no effort to develop their own large models. Among them, the training data is an important prerequisite for the quality of the large model.

Recently, according to Insider reports, Microsoft-backed OpenAI, Google and its backed Anthropic have been using online content from other websites or companies to train their generative AI models for years. This was all done without asking for specific permission, and will form part of a brewing legal battle that will determine the future of the web and how copyright law is applied in this new era.

These big tech companies might argue that they're fair use, but whether that's really the case is debatable. But they won't let their content be used to train other AI models. So it begs the question, why are these big tech companies able to use online content from other companies when training their big models?

These companies are smart, but also very hypocritical

Whether big tech companies use other people's online content but don't allow others to use their own has solid evidence, which can be seen in the terms of service and use of some of their products.

Let's first look at Claude, an AI assistant similar to ChatGPT launched by Anthropic. The system can complete tasks such as summarization, search, assisted creation, question answering, and coding. Some time ago, it was upgraded again, and the context token was extended to 100k, and the processing speed was greatly accelerated.

Claude's terms of service are as follows. You may not access or use the Service in the following manner (some of which are listed here), and to the extent any of these restrictions are inconsistent or unclear with the Acceptable Use Policy, the latter shall prevail:

  • develop any products or services that compete with our Services, including developing or training any AI or machine learning algorithms or models
  • Scraping, scraping or otherwise obtaining data or information from our services not permitted by the Terms

Claude Terms of Service address:

Likewise, Google's Generative AI Terms of Use states, "You may not use the Service to develop machine learning models or related techniques."

Google Generative AI terms of use address:

What about OpenAI's terms of use? Similar to Google, "You may not use the output of this service to develop models that compete with OpenAI."

OpenAI terms of use address:

These companies are smart enough to know that high-quality content is critical to training new AI models, so it makes sense not to allow others to use their output in this way. But how do they explain their reckless use of other people's data to train their own models?

OpenAI, Google, and Anthropic declined Insider's request for comment and did not respond.

Reddit, Twitter and others: Enough is enough

In fact, other companies were not happy when they realized what was happening. In April, Reddit, which has been used for years to train AI models, plans to start charging for access to its data.

Reddit CEO Steve Huffman said, "Reddit's data corpus is so valuable that we can't give that value away for free to the largest companies in the world."

Also in April this year, Musk accused OpenAI's main supporter Microsoft of illegally using Twitter's data to train AI models. "Litigation time," he tweeted.

But in response to Insider's comments, Microsoft said "the premise is so wrong, I don't even know where to start."

OpenAI CEO Sam Altman tries to take this question a step further by exploring new AI models that respect copyright. “We’re trying to develop a model where if the AI system uses your content, or uses your style, you get paid for it,” he said recently, as reported by Axios.

Publishers (including Insiders) will have a vested interest. In addition, some publishers, including News Corporation of the United States, are already pushing technology companies to pay to use their content to train AI models.

The current AI model training method "breaks" the network

Some former Microsoft executives said that there must be a problem. Microsoft veteran and well-known software developer Steven Sinofsky believes that the current way of training AI models "breaks" the network.

He wrote on Twitter, "In the past, crawling data was used in exchange for click-through rates. But now it is only used to train a model and does not bring any value to creators and copyright owners."

Perhaps, as more companies wake up, this uneven data usage in the generative AI era will soon be changed.

Original Link:

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)