
UK and U.S. economic prosperity deal takes effect – Key takeaways
On 19 June 2025, CNIL published two additional “how-to-sheets” on artificial intelligence, one on the legitimate interest and the other on the collection of data via web scraping. These documents aim to clarify the rules applicable to the creation of training datasets containing personal data.
On June 19, 2025, CNIL published two additional “how-to-sheets” on artificial intelligence. The first one sets out the conditions under which the legal basis of legitimate interest may be used for the development of an AI system (see here our post), while the second focuses specifically on the collection of data via web scraping.
In its second "how-to-sheet,” CNIL details the measures to be taken for proper collection of data via web scraping. The widespread use of web scraping has indeed significantly changed how the Internet is used, making all data published online potentially accessible, collectable, and reusable. This practice raises significant risks for data subjects, including:
Although web scraping is not prohibited per se, the CNIL emphasizes the need for a case-by-case assessment and calls for the implementation of appropriate safeguards. It also recommends the introduction of specific legislation to regulate scraping practices by public authorities. In the absence of such a framework, the CNIL reminds controllers of their obligations and sets out the conditions under which these practices may be used for training AI systems.
The CNIL reiterates that certain measures are mandatory, particularly under the data minimization principle (Article 5.1(c) of the GDPR). This includes:
The CNIL emphasizes that special attention must be paid to sensitive data, given the large volumes typically involved. Residual and unintentional collection of such data, despite precautions, is not unlawful per se, as confirmed by the CJEU (Case C-136/17). However, once a controller becomes aware that it is processing sensitive data, it must ensure immediate deletion, using automated means where possible.
Additionally, the CNIL recalls that processing sensitive data is only permitted by way of exception, notably where the data have been manifestly made public by the data subject. This requires a clear, positive action by the individual, made knowingly (CJEU, Case C-252/21, Meta Platforms).
To ensure the balance required under the legitimate interest basis, the controller must also take into account the reasonable expectations of data subjects. In this respect, the CNIL refers to the following criteria:
Finally, the CNIL stresses that the controller will generally need to implement additional safeguards to mitigate the impact on data subjects' rights and freedoms, particularly in view of the intended use of the trained AI system and its actual impact on data subjects. The controller must assess on a case-by-case basis whether such safeguards are required, depending on the specific modalities of the processing. Recommended safeguards include:
Authored by Joséphine Beaufour and Julie Schwartz.