< Back to blog

How to Respect GDPR Principles in a Data Lake Project?

Lea Richard, Data Protection Officer, ex-TikTok and Ledger, answered ITBusinessCrush questions regarding data protection in a data lake project. The EU General DataProtection Regulation (GDPR) imposes obligations onto organizations anywhere, so long as they target or collect data related to people in the EU. The nature of data handled for a data lake project leads to additional considerations that Lea detailed during our discussion.

How do you approach the subject of the GDPR?

 Taking the subject through a purely legal lens can be very counter-predictive in the medium term because it does not allow business integration and collaborative management of these projects, even though the GDPR implementation is 80% operational and technical. I prefer to talk with companies about project management, customer success, and optimization of tool acquisition. 


What is at stake with a data lake project from a GDPR point of view?

 From a GDPR point of view, the stake is the knowledge of the data and its purpose: the less structured the data, the more complicated the compliance with the regulation. The case of the data lake, which is a pool of data of all kinds, is a red flag for the Data Protection Officer (DPO). As a result, it is necessary to determine what type of sensitive data, not only within the meaning of the GDPR, will be put in it and for what use. The challenge is to document the approach and to be able to justify reasoning to customers, final consumers, or supervisory authorities. If I know why I have structured my data in such a way, or if it is not structured, how I adjust the volume with the quality of the treatment, it's a good start.


What is sensitive data within the meaning of the GDPR?

 A data set is sensitive when it gives indications, more or less direct or inferred, on the person's religion, sexual orientation, political opinions, or health – many health data are sensitive, but not all. And then there is the notion of sensitive data in the sense of business and security:

  • Is this data confidential? For example, it is part of a fundraising document.
  • Is it identified as an easy target in cyber-attacks?
  • Is it easily reachable?

For example, a mailing address can be used for a physical extortion attempt.

A data has three possible values in terms of security:

  • Integrity
  • Availability: from a legal point of view, this determines the level of contractual liability. If a customer or a partner needs it intensely, and its unavailability could lead to consequences for their business, this is an element to consider in the risk analysis. 
  • Confidentiality

I always makes a broader analysis than the GDPR to acknowledge business parameters and business continuity. For example, the project team must consider the provision of a particular database for machine learning from a continuity point of view. This approach allows making the notion of sensitive and personal data transverse throughout the business. It's even more critical when deploying a project over several geographical areas because the regulations are not the same. Looking at data from a business perspective speeds up analysis without getting lost in legal specificities.

Professional email is generally not considered sensitive data unless the business model requires that we cannot know who intervenes in the process. 


Who should be involved in the data lake project?

 When analyzing the risks on a data lake project, you should ideally include the CISO, the DPO, an IT representative , and one of IT infrastructure.

Putting the CISO and the DPO in the loop as soon as possible is necessary, not necessarily for immediate opinion, but for letting them acculturate themselves to the subject. There is an inequality of knowledge and skills on data subjects, and there is a semantic differential. It is necessary to give the time of acculturation to the DPO, especially if it's a lawyer, regarding the business project and the technical implementation. He must acquire a detailed knowledge of the data to establish a risk matrix combining security and regulation. 

It is necessary to create a partnership for data governance to avoid later redesign costs or even project termination. It's pretty easy to set up a monthly committee to talk about the topic, where everyone can learn from each other. 


What is the role of software publishers in data protection? 

 Their role has evolved considerably, especially with the decisions of the European Court of Justice. The Max Schrems II judgment, which it has just delivered, annuls for the second time in five years a legal instrument that allowed European companies to transfer data to the United States. 

From there, it is up to publishers, who from a GDPR point of view are data processors, to explain to data controllers how data can be transferred in a lawful and secure way to the United States. The manner where the US government and its agencies can access data must be equivalent to the European framework. This transfer is very difficult to guarantee. There was last month an agreement on a new transfer mechanism, but there is much work to be done before it is clear to lawyers and operational staff. 

The result for publishers, who are subcontractors, is that they had to set up documentation centers and adapt their contractual clauses. All this constitutes a form of a guide, which facilitates the conversation, and even allows customers to acquire negotiating levers on security guarantees and hosting. 

The DPO must look at the safeguards offered by publishers, such as algorithmic data privacy bricks that can be integrated into the data lake platform. For example: Gretel. Therefore, the DPO must acculturate from a technological point of view to be able to propose this type of solution at the right time during the construction of the data lake. 


What special treatment applies to unstructured data?

 The GDPR does not differentiate between structured and unstructured data, and it is one of the legal boundaries. Qualifying unstructured data from a GDPR point of view requires mapping it and therefore developing an ontology that makes it possible to identify it. There is a growing market for data mapping; I think of companies like Ethyca in the United States or a small French company like Leto that promise to plug into the data lake and map everything in two strokes of a spoon to click! In the meantime, we do it a little by hand, relying on the maps already made by the CISO, and by knowing the business, because that'show we can induce data classifications. 

How do you plan the customers' right to access data?

 By asking what are, at a minimum, the data that the different types of customers can require, structure from the outset and automate the transmission of this data. The right of access is not a right to see all the company data that could concern you from near or far. There is a first frontier that is already that of the clarity of the request: the person concerned must be able to formulate what he is looking for. A request like "give me everything you have on me in your data lake"can easily be challenged because there is no added value in sending everything from a logical point of view. Or it's like in Max Schrems' story with Facebook. Schrems was an Austrian student activist and asked Facebook for all his data, which sent him 24,000 pages at home in boxes, probably to dissuade him from pursuing his actions! 


On what the client wants and what he is entitled to expect, the two questions to ask are:

  • Do you know who I am? Do you know how to identify me directly? If it is anonymized data with different entropy levels, it will only identify the person.
  • Can the company make decisions about me based on prediction and profiling models that will impact my free will and my decision-making? Can it create discrimination? Typically, when you look at social networks, it's legitimate for them to have an answer on how they profile me and how they use that data to sell my profile to advertisers. 


So, this last question will be answered by exercising other rights (than the right of access), such as the right to information. Providers must specify which profiling activity are considered and from which data. But without giving access to it, because it would be an unbearable cost for the company.


What about internal access to data in the company?

 In analyzing a data lake project, the access policy is an important point. When it is difficult to map data or its risks, it is a good reflex to list on a "need-to-know basis," by function or by name, who must have access. The security issue is to know what happens if the data leaks, if you are a victim of ransomware, and it becomes public. Do we want to multiply the investigations times the number of people who have been able to access this data? Do we have a clear policy that allows the investigations to happen promptly? 


If there is a good culture of data protection, with a clear understanding of what it represents as an issue of trust, reputation, and income protection, it means that when it comes to having these conversations, everyone has the main principles in mind, and therefore enjoys either a level of autonomy or a level of trust. At Ledger, I spent months and months doing just that, awareness, because I was all alone, and that's all I could do.


May 2022

Other posts

The Cyberwar Is Raging, and This Is Not New

As cyber-attacks continue on the Ukrainian government, it reminds us what happened a few years back with the devastating cyber-attack led by Russia.
February 24, 2022

Transform Your Teams' Training Experience With Pitchboy

Discover how speech-to-text and deep-fake can power up your internal training sessions that become fully interactive.
March 2, 2022

Subscribe to our blog

For IT leaders growing their company's business
Thank you !
Oops! Something went wrong while submitting the form.