Featuring:
Rowan Curran, Analyst
Show notes:
One of the biggest limitations to the development of effective AI right now is access to relevant and risk-free data. That’s where synthetic data can help. Analyst Rowan Curran joins the podcast to discuss how synthetic data can help expedite AI efforts.
The episode starts with a definition of what synthetic data for AI actually means. Curran points out that synthetic data generated for AI is different than synthetic data used for load-testing or performance-testing data. “We’re talking about data sets that mimic real-world data,” he says. “There is just not enough data of the right type or quality to infer and predict the things we want to predict.” He also emphasizes that synthetic data isn’t “fake” but rather is “synthesized” data for a specific use.
Using synthetic data to test AI models has some key advantages over simply encrypting or anonymizing actual data. Because synthetic data doesn’t actually represent a real person’s identity or traits, there is no risk of releasing personal information accidentally or through an attack. For example, inference attacks, while not common, can be used to infer certain things about real data that sits behind an AI model. Using synthetic data behind the model can eliminate that risk, which is especially valuable in healthcare where patient data is used. Curran also mentions that synthetic data can help alleviate governance concerns around sharing personal data (such as patient or customer data) between business partners.
From there, the discussion turns to how synthetic data for AI is created. In some cases, it’s an extrapolation of an existing data set to create a much larger one that closely mirrors the original but is not actual personal data. There are also platforms that will generate synthetic data based on specific parameters or inputs, which is useful in computer vision applications where a user may want to generate a 3D object for an online game or virtual world.
Throughout the episode, Curran provides specific examples of practical use cases for synthetic data and even some cases where it may not be the best solution. Be sure to stick around for his closing thoughts on the future potential of synthetic data for use in AI.