by Alexandra Ebert

Why synthetic data is changing AI, data, and privacy

AI generated synthetic data (soft, bright music) not only enables better data privacy, but it can also empower every department in your organization to work more effectively with data. It's an anonymization technology that retains near perfect data utility while keeping the privacy of your customers securely protected. I'm Alexandra Ebert, a responsible AI privacy and synthetic data expert and the host of the "Data Democratization" podcast and Mostly.ai's Chief Trust Officer. And in this LinkedIn learning course, we will talk about the fundamentals of AI generated synthetic data for both data scientists as well as data executives and how synthetic data contributes to building responsible, fair, and explainable AI that benefits everyone.

Why is Synthetic Data Important

The business problem of data vs. privacy

Synthetic data will become the key enabler for AI in business and policy applications. This was the conclusion of the 2022 synthetic data report from the European Commission's Joint Research Center. The analyst from Gartner stated that by 2024, 60% of all AI training data will be synthetically generated. They also pointed out that it's one of the three key actions for AI and data leaders to use synthetic data instead of real data to quote transform their data from a liability to an asset. What can we take from this? AI and data-driven innovation needs synthetic data, but why? Because synthetic data solves the business problem of data versus privacy. And that's a huge challenge for nearly every larger business to consumer organization, regardless at which industry you look at, banks, insurances, healthcare organizations, retailers, telco providers, they all are pressured to innovate with the data assets to digitally transform themselves and to offer increasingly personalized products and services to stay competitive and to meet the ever-growing expectations of today's customers. But in these spaces, innovating with data assets oftentimes means touching personal data of your customers or your employees, which isn't particularly easy if you plan to do business in any of the 120 plus countries that are governed by data protection laws like the GDPR, Europe's General Data Protection Regulation, or California's Consumer Privacy Act for example. Don't get me wrong, I'm a big supporter of modern privacy laws and think they do a phenomenal job in safeguarding people's privacy and creating this necessary awareness that privacy protection is important. But what I do see in practice is that they also severely restrict or prohibit how businesses can use personal data. And that creates a massive problem because contrary to what many believe personal data is not only demographic data like your name, your social security number, or your home address, but something that's much broader and very importantly, it also encompasses the type of customer data that's most valuable to businesses. Behavioral data. What's that? Well, basically every action you take online or that it's recorded digitally. Think back when you purchased your mom's birthday present on Amazon and paid with your credit card. That's behavioral data or your location data that your phone carrier recorded when you stepped into a coffee shop Tuesday morning. That too is highly privacy sensitive personal data that needs to be protected, and that can only be processed in compliance with privacy laws. And at the same time this is your most valuable data resource as an organization that you want to get access to, particularly for AI and data-driven innovation. Because in the second decade of the 21st century, it just doesn't cut it anymore to personalize based on course demographic profiles and treat all of your customers the same. Just because they share an age group, a gender or a zip code. Humanity's diverse and highly unique and it's our behaviors that tell you much more about who we actually are, what we want to buy and when we are receptive to your personalized product offerings. So what does all of this mean for your company? If your organization caters to end consumers then the majority of your data assets are personal data. They're governed under strict privacy laws and they're not easy to access for innovation yet to stay competitive and relevant. You must find a way to reconcile behavioral data utilization with privacy compliance and this is where synthetic data comes in. But before we take a closer look into what synthetic data is and why it helps you here let's first look into legacy anonymization and why it is not a solution to overcome this business problem of data versus privacy.

The pitfalls of legacy anonymization

If you want to use data while protecting your customer's privacy, why not just anonymize it? And in theory, that's a great idea as anonymous data is explicitly exempt from modern data protection regulations like GDPR. But when people talk about anonymization, they most often refer to what I describe as legacy anonymization techniques, data masking, obfuscation, and various other approaches. What all of these have in common is that they try to delete, distort, strike through, or shuffle around those parts of a dataset that are deemed to be privacy sensitive, like your last name, social security number, birthdate, home address, and the like while leaving other parts of the data intact. But there are two major problems with that. First, all of these approaches are destructive in nature. They work by deleting the majority of your information, the majority of the valuable insights that are waiting to be uncovered. And this diminishes your dataset's utility especially if you want to use it for sophisticated analysis or machine learning. Second, and probably even worse for a privacy protection technique, legacy anonymization fails to protect privacy in the era of big data. And there are dozens and dozens of peer reviewed papers showing that legacy anonymization is not capable to safely anonymize data anymore. Yes, these techniques might have worked back in the days when organizations only had small data. Think of demographic information and overall, only a few dozens of attributes per customer. But today, businesses have big data assets, hundreds if not thousands or 10 thousands of attributes for each single customer because they not only collect demographics but detailed behavioral data like financial transactions or healthcare records, which are highly dimensional in nature and thus, notoriously more complex to anonymize and protect. So what all these studies found was that regardless of how many data points you delete or mask, if there is real data left intact, even if it's just tiny bits and pieces, then you're introducing a severe re-identification risk. To illustrate, think of credit card transaction data. Presumably everyone owning a credit card will have at least a few dozens if not a few hundred of credit card transactions per year. And one study found that three credit card transactions alone were sufficient to re-identify over 80% of individuals. Three out of the hundreds of transactions that exist per customer. And if that weren't bad enough, not even the entire transaction data was needed to infringe the individual's privacy. Only the merchant and the date of the transaction had to be revealed. And that's not a unique example from the financial services industry. The same thing holds true for healthcare data, mobility data, and any other kind of behavioral data because data about our behaviors and actions is so rich, so high dimensional, that every individual quickly becomes extraordinarily unique, easy to re-identify, and thus, impossible to anonymize with legacy anonymization techniques. And sadly, there are also plenty of examples of companies that continue to rely on these outdated anonymization techniques and released what they believed was anonymous data only to find out shortly after that the data wasn't so anonymous after all. So what's the result of this? What's the impact for data-driven businesses? While properly anonymized data is exempt from modern privacy laws, legacy anonymization is not fit for purpose anymore. With legacy anonymization, you can neither protect your customer's privacy, nor preserve the utility that you need in your data. And continuing to rely on them introduces severe legal, financial, and also reputational risks to your organization. And these pitfalls of legacy anonymization paired with the ever-growing need of organizations to access data in a privacy safe manner, this is the exact reason why synthetic data is seen as the key enabler for artificial intelligence, for data democratization, and for data innovation that respects people's privacy. Because it's this game-changing technology that allows to do something that no legacy anonymization technology prior to it was capable of, fully and irreversibly anonymizing big data sets while retaining near perfect data utility. And this is why synthetic data helps organizations to unlock and tap into this huge potential that lies hidden in behavioral data by allowing them to safely use it, share it, and collaborate on it with external partners. "How?" you might wonder. No worries. We have a whole course to cover this in detail.

Syntetic Data Fundamentals

What is synthetic data?

So let's start our discussion about what synthetic data is with a definition of it. But keep in mind it's not a clearly defined term. So, here is how I define it. Privacy-preserving AI-generated synthetic data is an anonymization technology that preserves data utility. It is artificial data that is created by a deep learning model, which is trained on real-world privacy sensitive data, and it can accurately and granularly retain the statistical properties of the real data it was trained upon. Yet, it is generated with a holistic set of privacy mechanisms, which ensure that the end result, the synthetic data, is fully anonymous. Now, so much for the definition of it, but let me give you a more tangible example of what synthetic data actually is. And to explain this, I always like to use images. Even though this course will not focus on images and other types of unstructured data, but unstructured tabular data, like financial transactions, healthcare data, and the like. So, let's take a look at these photos for a moment. They look pretty convincing and realistic, right? But as you could have guessed it, they are AI-generated synthetic faces. The human beings of this photos have never existed. They trained a powerful deep learning algorithm on thousands and thousands of human faces, up to the point where this algorithm really understood, how does a human face look like? What are the logics of how a human face is constructed? It learned things like humans have two eyes, which are roughly positioned in the middle of the face. Noses, which come in these shapes, sizes and forms. And hairstyles and skin tones that range from this to that spectrum. All of this was learned and automatically captured. And then once this training step was completed, in a completely separate step, you could use this generator to create a new, unlimited number of synthetic images. And all of them looked highly realistic. Yet, none of them had any direct attribute from the training samples. So it was not a legacy anonymization-like process where you take the pair of eyes from training example A and put it in your synthetic face. And the mouth from training example B and shuffle all of this together. But it was really created from scratch based off the logics that the deep learning algorithm, the synthetic data generator extracted. And a few years back, having AI-generated photos that looked that realistic was revolutionary. But now we have Chat GPT, DALL-E 2, and the power of generative AI has landed in everyone's pocket. In everyone's smartphone. And getting an AI-generated text, or image, or synthetically created video is something that's cool. But people are growing and getting accustomed to it. But what revolutionized the space of privacy protection was when you combine this power of deep learning together with privacy mechanisms, which made it possible to unleash the power of generative AI on your own data assets that you have within your organization. And why is that relevant? Because even though this large generative image tools, or data generators, are interesting, they're trained on data where you don't know what the source actually is, and they're not tasked to be factful or truthful. But if you use generative AI, synthetic data, on your own data, then you get results that you can actually trust. So with structured AI-generated privacy preserving data, you have the power of generative AI at your fingertips, and have a fully anonymous version that's as close to your real customer data as possible, but without any privacy risks attached to it. So let's jump to the next video and walk through in more detail how privacy preserving AI-generated synthetic data in the structured space is created.

How is synthetic data generated?

Creating synthetic data is a two-step process. First, you have the training phase, and second you have the generation phase. Let's walk you through a practical example. Imagine we have a bank with its customers and their privacy sensitive financial transactions. If this bank wants to create a synthetic, fully anonymous version of their customer base, how would they do it? First, they would use a synthetic data generator to train on the original privacy sensitive customer data. And this generator, due to the deep learning algorithm that's part of this generator, is capable to automatically learn all the patterns, the correlations, the time dependencies, anything that is there to learn about the customers. To simplify, the algorithm can basically understand how the customers of the bank act and behave. And then once this training phase is completed, in a completely separate step, the generator can be used to create new synthetic customers and their synthetic financial transactions from scratch. And if you look at those two data sets, the real privacy sensitive dataset, and the anonymous synthetic version thereof, from a statistical point of view, they're nearly indistinguishable. You will find the same patterns, correlations, and time dependencies in there. And it's not only highly statistically representative, but the synthetic data is also highly realistic and structurally identical to your original data. This is why synthetic data works so well as a drop in replacement for production data across many different use cases. But keep in mind, it's an anonymization technology, so of course you don't get 100% of the information. Retaining that much simply wouldn't be possible due to privacy. But with synthetic data, you get as close to the real information as possible, while ensuring that nothing privacy sensitive finds its way into the synthetic dataset. And to achieve this, a generator for privacy preserving synthetic data needs to have the necessary privacy mechanisms in place. So the powerful deep learning algorithm is what gives you the accuracy of your synthetic data. While the privacy mechanisms give you the anonymity by making sure that this generator only learns statistically generalizable patterns, even if it's down to a very granular level, but that nothing that can be considered a personal privacy sensitive secret of one single customer, or even a smaller customer group, can find its way into the synthetic data. That's it, that's how you create privacy preserving synthetic data. But I know, this can be a lot at the beginning, so let me highlight those points that are particularly important to this process. First, synthetic data for anonymization is not created out of thin air. You need to have real data from the beginning. Second, if you want to have privacy preserving synthetic data that's fully anonymous and exempt from privacy laws, having a powerful synthetic data generator alone is not enough, you must have the necessary privacy mechanisms in there to ensure successful anonymization. And we look into what those privacy mechanisms are in more detail later in this course. Third, synthetic data is fundamentally different than legacy anonymization. You don't just shuffle around or mask some parts of the data while leaving other parts intact. The entire synthetic data set is created from scratch and on an individual level, there's no one-to-one relationship between any of your real customers and your synthetically generated ones. Fourth, and lastly, the end result that you get. The synthetic data consists of statistically highly realistic and highly representative data

How is synthetic data generated?

Creating synthetic data is a two-step process. First, you have the training phase, and second you have the generation phase. Let's walk you through a practical example. Imagine we have a bank with its customers and their privacy sensitive financial transactions. If this bank wants to create a synthetic, fully anonymous version of their customer base, how would they do it? First, they would use a synthetic data generator to train on the original privacy sensitive customer data. And this generator, due to the deep learning algorithm that's part of this generator, is capable to automatically learn all the patterns, the correlations, the time dependencies, anything that is there to learn about the customers. To simplify, the algorithm can basically understand how the customers of the bank act and behave. And then once this training phase is completed, in a completely separate step, the generator can be used to create new synthetic customers and their synthetic financial transactions from scratch. And if you look at those two data sets, the real privacy sensitive dataset, and the anonymous synthetic version thereof, from a statistical point of view, they're nearly indistinguishable. You will find the same patterns, correlations, and time dependencies in there. And it's not only highly statistically representative, but the synthetic data is also highly realistic and structurally identical to your original data. This is why synthetic data works so well as a drop in replacement for production data across many different use cases. But keep in mind, it's an anonymization technology, so of course you don't get 100% of the information. Retaining that much simply wouldn't be possible due to privacy. But with synthetic data, you get as close to the real information as possible, while ensuring that nothing privacy sensitive finds its way into the synthetic dataset. And to achieve this, a generator for privacy preserving synthetic data needs to have the necessary privacy mechanisms in place. So the powerful deep learning algorithm is what gives you the accuracy of your synthetic data. While the privacy mechanisms give you the anonymity by making sure that this generator only learns statistically generalizable patterns, even if it's down to a very granular level, but that nothing that can be considered a personal privacy sensitive secret of one single customer, or even a smaller customer group, can find its way into the synthetic data. That's it, that's how you create privacy preserving synthetic data. But I know, this can be a lot at the beginning, so let me highlight those points that are particularly important to this process. First, synthetic data for anonymization is not created out of thin air. You need to have real data from the beginning. Second, if you want to have privacy preserving synthetic data that's fully anonymous and exempt from privacy laws, having a powerful synthetic data generator alone is not enough, you must have the necessary privacy mechanisms in there to ensure successful anonymization. And we look into what those privacy mechanisms are in more detail later in this course. Third, synthetic data is fundamentally different than legacy anonymization. You don't just shuffle around or mask some parts of the data while leaving other parts intact. The entire synthetic data set is created from scratch and on an individual level, there's no one-to-one relationship between any of your real customers and your synthetically generated ones. Fourth, and lastly, the end result that you get. The synthetic data consists of statistically highly realistic and highly representative data

What are the benefits of synthetic data?

Using synthetic data instead of real data or data that was anonymized with legacy anonymization comes with a handful of benefits. First, it's obviously privacy protection. Your customer's privacy is kept safe and secure. With synthetic data, you have fully anonymous data, which cannot be reverse-engineered, and this is why it's exempted from privacy regulations and free to use and share. The second benefit is the utility. In contrast to legacy anonymization technologies, you can have data that is nearly as good as your real production data, but without the privacy risks. This means that you have a dataset that's structurally identical to your production data and that's also as granular as your production data itself, which is key particularly for sophisticated analytics and machine learning use cases. An added benefit of this utility is explorability. In the days of legacy anonymization, you always had to know in advance which columns you needed to perform a given analysis. But particularly, if we look into AI training, we can't tell before we've seen the data which columns will be the most relevance, where we have the most valuable insights about our customers. While you only get a handful of columns with legacy anonymization technologies, synthetic data gives you as many columns as your original data had. If you had 200 columns in your customer database, you will get 200 columns populated with synthetic, realistic, and representative customers. And the bottom line of this is you can look into the data, explore it, and discover the insights to build your analysis and AI models on top of this. Next is the speed, how fast you can access data. Particularly for larger organizations, accessing data regardless of whether it's production data or legacy anonymized data is something that takes several weeks, if you're lucky. But much more common, it takes months, especially if you want to share the data externally. With synthetic data, it becomes possible to automate many of the processes that were previously involved with legacy anonymization procedures, and thereby, get access to data in a matter of hours or a business day. And this not only translates to faster time to data, but also faster time to value, given how critical data is in many business applications and scenarios nowadays. Next, synthetic data can improve your customer understanding. Legacy anonymization technologies tend to give you the average customer. You lose the outer edges of your customer distribution, you lose the minority groups and the outliers. Obviously, also synthetic data doesn't tell you about the unique unicorn in your customer set, but you get much closer to the edges. You get much better insights, not only about the average Jane and John Doe, but this full spectrum of human diversity. And this is something that can translate both in better personalization for different customer segments, but also in the context of fairness in more inclusive services because you see how diverse your customer base actually is. Synthetic data also saves costs when compared to legacy anonymization. Legacy anonymization technologies require plenty of people and plenty of working hours being dedicated to deciding things like how many and which attributes you can share. There might be 200 columns in total, and you definitely can't reveal that much, but is it 5 you can share, 7, 10? Also, which combinations are safe to reveal, and when do you cross into the toxic combinations, which means having variables that are too risky to share in combination. All of this is really hard to figure out. You have to work with plenty of assumptions, and once you're done, it's not even possible to evaluate whether you successfully anonymized that data. It's a super cumbersome case-by-case process, and this obviously costs a lot of money. So with synthetic data, on the other hand, all of this can be automated. Regardless of which dataset you synthesize, it's always the same algorithm, always the same process, and this makes it much more easy to automate and to save costs. And lastly, I want to highlight data democratization. Data is such a valuable resource nowadays that it shouldn't be only a privileged group of a few data scientists or folks accessing data. It should be a resource that's democratized within your organization, or potentially even beyond the borders of your company. Only then can all the creative minds within the different departments see and understand who your customers actually are and come up with new and innovative ideas on how to better serve them or create more innovative products. So obviously, synthetic data is an enabling technology. It gives you access to data where you can't use real data due to privacy concerns, and it comes with a whole host of benefits. For one organization, the speed in accessing data might be the most important one. For another organization, it might be the added security or the ability to quickly share data externally. Now, think about your own organization and the data you have and your synthetic data use case, which is the deciding factor that would be the biggest win for what you are trying to achieve?

What are the limitations of synthetic data?

As we talked about, synthetic data is a tremendous tool to protect privacy and it has many different benefits. But as with everything, synthetic data also comes with limitations. The first important limitation about synthetic data is that it's a technology that is designed to protect privacy, not proprietary business information. Sometimes, and particularly if you consider openly sharing synthetic data assets, the question comes up "What about our confidential information?" What about information where privacy is not an issue but that's nonetheless sensitive? For example, if you would share a synthetic version of your customer base, competitors could see something like who your most profitable customer segment is or which age segments you're targeting. So it's important to keep in mind that synthetic data's job is to protect the privacy of your customers, not your proprietary business information. And whenever you want to share synthetic data with a partner, with an external research firm, a startup or even openly, you need to keep this in mind. In some of the cases you might want to consider only sharing a sample of your synthetic data or applying other sampling strategies to not reveal sensitive information about your business. Secondly, synthetic data is a big data anonymization technology. If you think of medical research studies where we sometimes have 20 or 70 people in a sample, you can't synthesize this data set and at the same time preserve its utility. In fact, you can't apply any anonymization technology to a sample that small and achieve meaningful privacy protection. Therefore, I would recommend to start thinking about using synthetic data if you have a data set with at least 5,000 or 10,000 customers or employees or whoever your privacy sensitive entity is that you want to protect. In practice, most organizations anyways uses on much larger data sets with hundreds of thousands or millions of customers in it. As you always get more insights and better synthetic data quality if you have more training data available for the deep learning algorithm to learn from. But just remember that it's not a technology that's appropriate for small data use cases. Thirdly, synthetic data, as we've established is an anonymization technology. It's not possible to reverse engineer synthetic data and to come back to the real data of your customers. Therefore, it's not suitable for certain use cases where you just want to temporarily protect the privacy of your customers, but need to have their full identity later on. This doesn't mean though, that you can't use synthetic data for personalization. You obviously can use it to uncover all the insights in your data, to figure out which patterns actually are important, and build your algorithms to use these patterns to predict customer churn, next best actions, or which products to recommend. But once you apply it to the real customers you obviously need the real data of those customers to make a functioning prediction. So keep in mind, synthetic data is not a technology that just temporarily protects privacy. Once it's synthetic, there's no way back. And lastly, a few other limitations to highlight, even though these might change as synthetic data progresses, today, you actually can't use synthetic data in real time scenarios and it's also not yet suitable to synthesize graph data. So now that we know the limitations and what it can't do, to conclude our chapter on understanding synthetic data at a high level, let's look at the different categories of synthetic data.

The different categories of synthetic data