There are wrong ways to fake your data. Whether you’re bootstrapping a dev environment, automating integration tests, or capacity testing in staging, we all need high-quality fake data. Regardless of your use-case, common data generation pitfalls can break your testing or, worse, leak sensitive data into unsecured environments.
There are ways to generate synthetic test data that achieve the realism required for effective testing along with the security needed to protect your company and customers. The secret lies in identifying potential anti-patterns and then preventing them from ever forming in the first place.
In the context of test data generation, we see anti-patterns emerging in one of three key ways.
- The process fails to mimic the complexity and requirements of real-world situations.
- The process fails to protect the privacy of the individuals behind the numbers.
- The process fails to work effectively for all data types and sources in the data ecosystem being mimicked.
The outcome of these failures? Broken data, worthless tests, bugs in production, and in the worst cases, a data security crisis.
A strong data generation infrastructure should have built-in tools to enable its users to generate the patterns they need as opposed to random data. Here, we’ll explore the bad (anti-patterns) to understand what it takes to enable the good (patterns).
1. A series of impossible events
Solution: Defined time series rules
From healthcare records to financial transactions, to student progress reports, data across industries and platforms is rich with event pipelines. Events can trigger actions in your product and reveal the success of your user journey. They’re a fundamental part of the user experience and the data that experience creates. For accurate testing, they need to be realistic.
Event pipelines generated at random inevitably create impossible time series. A quality solution allows you to define the relationships between events in your data by linking related fields and dictating the order in which they occur. For the highest degree of accuracy, an event generator should also be designed to mirror the distribution of the dates in your original dataset. It’s a combination of complex algorithms on the back-end and customizable rules on the front-end.
2. Random categorical shuffling
Solution: Shuffling with defined ratios
A frequently used way to obfuscate real data is by shuffling categorical data, for example, the job titles of employees within an organization. The risk in shuffling this data is that it can wipe out the integrity of the data if the ratios, and their relationships to other fields within your dataset, aren’t preserved. For example, imagine you’re generating a synthetic workforce. Random generation might come up with 20 assistants for a single manager or 20 managers with a single assistant.
The ratios and relationships between categories make all the difference in whether the data you generate will be able to simulate real-world situations. A well-designed algorithm for categorical shuffling must take this into account to generate distributions of categorical data that mirrors the reality in your original data.
3. Unmapped relationships
Solution: Column linking
The vast majority of data involves logical relationships that any human would immediately recognize, but random generators will not draw these relationships unless they are given rules to do so. When the underlying data does not reflect real-world relationships, your testing cannot reflect real-world usage. Not linking columns with a defined relationship during generation can lead to the formation of anti-patterns where you least expect them, both in your testing and in your product.
The tool for avoiding this hazard is the capability of linking as many columns as you need to ensure that dependencies are captured and the stories in your data ring true. So, for example, in a table of payroll data, bonuses become a function of salaries which are tied to job titles partitioned by office location.
4. Inconsistent transformations
Solution: Input-to-output consistency
Even when anonymizing data, it’s often important to anonymize certain values in the same way throughout your dataset. Performing inconsistent data transformations can easily break your data to the point that it’s no longer usable. De-identifying data consistently is the pattern you need. It means that the same input will always map to the same output, throughout your database, allowing you to preserve the cardinality of a column, match duplicate data across databases, or fully anonymize a field and still use it in a join.
Perhaps you have a user database that contains a username in both a column and a JSON blob as well as another database that contains their website activity. Consistency enables you to safely anonymize the username, but still, have that identifier be the same in all locations.
5. Sensitive data leakage
Solution: Identify and flag PII/PHI
When you’re dealing with personally identifiable information (PII) or protected health information (PHI), your company has a legal obligation to maintain data privacy. The first step in de-identifying PII is identifying columns containing sensitive information and flagging them as needing protection throughout your database. An algorithm can do this quickly and at scale, but it must be carefully built. Imagine a column of birthdates named student_BD instead of birthdate or DOB. A de-identification system that only relies on column names to find PII may not flag that column as sensitive, and a data privacy anti-pattern is born.
An effective de-identification system uses machine learning to examine both column names and the data within those columns to determine what may or may not be PII. And once the PII is identified, it must be flagged by the system in a way that ensures it will be protected without slipping through.
6. Unaccounted-for schema changes
Solution: Flagging schema changes and refreshing test data on demand
Schema changes are the only constant in modern data ecosystems. Failing to account for these changes, even seemingly minor ones, can lead to failures in your automated testing and, equally as important, risky data leaks. In the best-case scenario, your test data may simply no longer work. In the worst, you’ve now got sensitive data in your lower environments.
The pattern you need here is a tool built into your data generation pipeline that alerts you to any schema changes as they come through. Better yet, it should require you to update your generation model before pulling new data into staging. An ideal system will also allow you to refresh your data on demand, multiple times a day, so your data truly represents a mirror production at all times, schema included.
7. Outliers revealing TMI
Solution: Adding noise with differential privacy
All data has its outliers. The more precise your data anonymization methods are, the more likely they are to pull those outliers through—outliers that could be used to re-identify individuals if the anonymized data is combined with other available resources. When outliers aren’t taken into consideration, they serve as bold clues to revealing what synthetic data is designed to protect.
The solution here is differential privacy, which adds noise to the data to create a more tempered pattern that obscures outliers. Differential privacy is a property that can be applied to data generation algorithms to guarantee a higher level of privacy in your output data. The more algorithms within your data generation process that can be made differentially private, the safer your outliers will be.
8. Insufficient integration
Solution: Cross-functional APIs and seamless integration into CI/CD pipelines
Given the nature of tech stacks today, integration should be a key feature of any system you put in place. Whether a data automation tool has an API shouldn’t even be a question you have to ask.
When it comes to building a data generation tool in-house, the process almost always involves writing scripts—scripts that are consistently prone to failure. Building a solution in-house isn’t just a matter of the initial lift; it also requires continuous maintenance to keep the system up and running.
Your data de-identification infrastructure should enable developers to move faster, not weigh them down with double work. A data mimicking tool that has an API, can connect to any data source, integrates seamlessly into your existing systems, and works with your data no matter how it changes over time equips your developers to do their best work. As your data needs evolve, so should the systems that support them.
9. Vendor lock-in
Solution: Support for all data sources
Relational databases still dominate the world of big data, but NoSQL databases like MongoDB and Cassandra are gaining fast. Even PostgreSQL can easily work with NoSQL code and store JSON files. The future is hybrid. Your data de-identification infrastructure needs to support that.
When it comes to building this infrastructure in-house, you may find yourself dedicating significant resources to creating a process that works with PostgreSQL, only to end up back at square one when your company adds Redshift to the stack. And if you’re using Mongo, you’re going to require an entirely different approach. What’s more, your data may live in separate database types, but that doesn’t mean it isn’t interrelated. Not only will your solution have to work for multiple databases, but it also has to work between them as well.
Given today’s ever-expanding data ecosystems, it simply doesn’t make sense to build a system that only works with one database type. Your data generation solution should work as seamlessly with Postgres as it does with Redshift, Databricks, DB2, and MongoDB. Anything less is just a roadblock to data management. Seek out a tool that will work with your data wherever you keep it, now or in the future.
Top Takeaways
Building a high-quality data mimicking and de-identification solution that satisfies all of the above is a major investment of resources, and with both data privacy and data utility on the line, the stakes are incredibly high. The ultimate anti-pattern may very well be burdening your team with all of these requirements in-house or settling for a solution that fails to deliver in any of these areas. The ultimate pattern? Use the above nine sections as a guide to building your own checklist, then seek out a proven platform that is ready to equip your team with all your fake data needs.
The post 9 Fake Data Anti-patterns and How to Avoid Them appeared first on Software Engineering Daily.
* This article was originally published here
No comments: