Infringement risks and legal paths in large AI model training

By Wang Yan and Bie Yanghong, Han Kun Law Offices
0
306
Whatsapp
Copy link

Large AI models have made significant strides in recent years, showcasing impressive capabilities. But the performance and effectiveness of these models heavily rely on the scale and quality of training datasets, which frequently include copyrighted material.

In this article, the authors explore potential copyright infringement risks and legal paths within the IP law framework for the AI industry when acquiring and using training datasets.

Copyright infringement risks

Wang Yan
Wang Yan
Partner
Han Kun Law Offices
Tel: +86 21 6080 0200
E-mail: yan.wang@hankunlaw.com

The training datasets used for training large AI models are likely to be protected by copyright law. Therefore, in the overall technological development process of large AI models, developers must first address the legality of the training datasets they use.

  1. Datasets are copyright-protected. Training datasets, whether self-collected, purchased from data service providers or scraped from public sources may include vast amounts of copyrighted works. Additionally, the originality of the data arrangement may qualify the dataset itself as a compilation work protected by copyright law.

For instance, the GPT-4 model’s training dataset contains substantial web text and requires original selection, classification and integration. Thus, the resulting corpus could be considered a protected compilation work due to its selective composition and originality in arrangement.

  1. Copyright exceptions in current legal frameworks. Various jurisdictions, such as Japan, the US and Europe have established possible copyright exceptions for large AI model training data/datasets.

For example, European countries including the UK, France and Germany have introduced “text and data mining exceptions” in their copyright laws. Copyright law of Japan also provides specific exceptions for “data analysis”. In the US, case law has included data storage and mining under the fair use doctrine, based on its theory of “transformative use”.

However, current provisions in China for fair use and statutory licensing in the Copyright Law do not effectively exempt the use of datasets in the training of large AI models from potential copyright infringement.

Specifically, article 24 of the Copyright Law stipulates the fair use system but it generally does not apply to use of datasets in the training of large AI models.

First, the development of generative AI models is mostly for commercial purposes, and the use of training datasets does not fall under personal research or appreciation. Neither is it easily categorised for classroom teaching or scientific research.

Furthermore, the training process for large AI models often involves extensive quoting or even full-text copying of others’ works, which may not meet the restrictions of appropriate citation.

Finally, although the Copyright Law establishes four types of statutory licensing systems, these systems often fail to meet the needs for copyright exemptions when using datasets for large AI model training.

Outlook

Bie Yanghong
Bie Yanghong
Associate
Han Kun Law Offices
Tel: +86 21 6080 0241
E-mail: yanghong.bie@hankunlaw.com

Throughout history, major technological shifts have often driven advancements in legal frameworks. For example, the growth of internet forums prompted the creation of “safe harbour” rules in the US Digital Millennium Copyright Act, while the proliferation of search engines led to reinterpretation of the fair use doctrine.

These new regulations emerged from efforts to balance the interests of various stakeholders.

Currently, large AI models have demonstrated considerable potential to boost productivity, yet their performance hinges on access to high-quality training datasets. It is crucial, however, to also fully safeguard the legitimate rights of copyright holders involved in these datasets.

Hence there is an urgent need to reconcile the tension between technological innovation and copyright protection through updated legal frameworks.

Given the swift advancement of large AI model technology, China could explore several approaches including those below to balance the interests of model developers and the copyright holders of training datasets, thereby ensuring that datasets are used legally.

  • Expanding the scope of fair use to include large AI model training activities that do not externally output generated content could be a viable approach.

For example, if AI systems are used solely for internal functions such as grading or data analysis, they may utilise copyrighted works during training without producing outputs that are identical or similar to the training content.

As such outputs do not materially affect the interests of copyright holders, treating these training activities as fair use could foster technological advancement while preserving the legitimate rights of rights holders.

  • Drawing on the “safe harbour” rules, a notice-and-block mechanism could be established. Under this scheme, when copyright holders identify potentially infringing content generated by large AI models, they can notify the developers.

If developers, upon receiving such notifications, can swiftly implement technical measures to block the infringing content and prevent its further distribution, they would be exempt from liability. This approach encourages developers to proactively regulate their actions and respond promptly to the legitimate demands of rights holders.


Wang Yan is a partner at Han Kun Law Offices. He can be contacted by phone at +86 21 6080 0200 and by email at yan.wang@hankunlaw.com

Bie Yanghong is an associate at Han Kun Law Offices. She can be contacted by phone at +86 21 6080 0241 and by email at yanghong.bie@hankunlaw.com

Whatsapp
Copy link