Fake data > big data

Recently, all HBOMax users got an interesting email. I didn't. I'm currently living off my sister-in-law's uncle's HBO account. I know. I'm a bum. Back to the email. No, it wasn't an announcement of a new season of Succession. Or a Game of Thrones season 8 redo. In fact, it wasn't from Marketing at all.

It was an Intern on the Engineering team.

This developer was testing a software system. When developers test, they usually do it utilizing "made up" data. In HBO's case, they botched the test by using real data (every user email address).

This is a harmless example of software testing gone wrong. It kickstarted people tweeting stories of mistakes they made when they were interns. This intern's mistake turned out to be free PR for HBO (someone promote the kid!).

Oh crap. This isn't making the point I'm trying to make, which is:

Developers aren't doing a good job of using high-quality fake/made up data to test their software. Causing greater issues than just, "oh look at the cute mistake our intern made!"

Tonic (see witty response in tweet above) solves this. But before we go into their solution, let's first understand the problem.

When developers build product (by writing code), they need to test it to make sure it works. Why do they need to test it?


If you're learning some new move in a sport (think a new dribble move in basketball), you don't first test it out on court against your number 1 competitor. The stakes are too high, and you're risking the win. So you test your new move in practice first. You do this because you'll likely fail at first, and the consequences of failing are lower in practice.

But you have to make sure when you test the move, it's realistic. Practicing your move alone won't set you up for success. Your practice environment needs to mirror the competitive game environment. Have someone trying to steal the ball from you or have a speaker setup with obnoxious fans. Spending time perfecting your move in a realistic practice environment means your execution will be flawless in a competitive game.

Developers are like athletes - they test new things in practice environments too!

Developers test new features (or products) in an environment called staging. Only employees have access to staging, not users of the product. When users like you and I use a product, that environment is called production. Developers make their staging environment an exact replica of the production environment. As mentioned before, making the practice environment (staging) a replica of the competitive game (production) ensures the best results (functioning software/fewer software bugs).

Practice = staging, live game = production

One key piece of creating a staging environment is data. Developers need access to production/customer data to see how users are using their product. This helps them make sure their new feature won't "break" if users try to use their product in a specific way. This is important because no user likes a buggy app that sucks.


Let's say a developer is trying to build a mobile banking app. They're now trying to test if the credit card features work. Can you pay down your credit card? Can you search your credit card transactions? Can you dispute a chargeback? Let's test!

But there's one problem. Due to banking regulations, the developer can't access customer data. To test, the developer has to come up with the data themselves by guessing all the ways users with a credit card would use the app. They may not anticipate all the ways users use the app. Since they can't test scenarios they're unaware of, they risk shipping a product that will have bugs that will annoy users, or worse.

A recent example in the worse category happened when a man ordered an Uber. He wasn't just any man, he was a married man. He had logged into his wife's phone to order an Uber and signed out shortly after. The end. JUST KIDDING, you're smarter than that. You're thinking in your head, "what did this idiot do?"

It turns out he was a cheating motherfucker. After he logged out of his Uber account on his wife's phone, she continued to receive Uber notifications of his pickups and drop-offs, which led to her discovering he was having an affair. This stems from a bug where Uber didn't revoke notification tokens on devices after an account logged out. Multiple users have complained of the same issue, making me believe Uber didn't have high-quality data to test all the ways users used their app, leading to the critical bug. To top this story off, the guy who got caught cheating sued Uber for $47 million. What a piece of work this guy is.

So how should developers get access to high-quality data to do proper software testing?

We've seen what happens when companies come up with their own fake or made up data in a testing environment. They miss potential ways users will interact with their software, which causes bugs.

I bet you think you have the answer now. There is loads of data on how users use the product generated in the production environment. We've all heard about "big data" and how it can help us make more informed decisions. It's simple, why aren't developers using the data generated in a production environment?

Besides the silly HBO incident, there's more serious risks to using customer data for software testing.

Regulation is making every company get serious about protecting consumer data and using fake data instead

In case you forgot, during the 2008 Financial Crisis, banks were kind of fucked. They had loads of mortgages they owned that were dropping in value, fast. They needed to sell them, faster. So they setup a call center so real estate buyers could call in and make offers on the properties the banks owned.

The bank didn't want call center employees negotiating on their behalf. Instead, they hired Palantir, a data analysis firm, to come up with a recommendation engine. When a real estate buyer made an offer, the employee would input that into the software and get one of two answers:

  1. "Yes, sell that shit!"
  2. "Are you shitting me? Counteroffer with this."

(It probably wasn't that vulgar, but I need to get my reps in if I'm ever going to write the movie version of Reddit users making Gamestop go to the moon).

Back to the story. The Palantir team faced a large problem when building the product. They needed data to test if the recommendation engine would work, free of bugs.

But they couldn't access the data, because they were at their office in Palo Alto. To access this highly sensitive customer data (people's credit scores, incomes, addresses), they needed to be at the bank's office (referred to as on-premise). This is due to certain banking regulations.

So 3 developers went on-premise to the bank's office to access the customer data. Using the customer data as their original source, they created a replica of it that maintained the same statistical & behavioral properties, without the identification properties. In human words, they kept important data points like the home's square feet and neighborhood, while removing more sensitive data - personal identifiable information (PII) like their name and social security number. This is the process of creating anonymized (or de-identified) data.

Now the Palantir team at the bank could send the anonymized dataset to their larger team in Palo Alto. The team was then able to build software and test it with accurate data, leading to fewer bugs and better functioning software!

But what's the problem with this solution? Building this solution took the Palantir team 2 weeks with 3 engineers. It also meant on-going maintenance. Why is that a problem you ask? Two reasons:

  1. Developers ain't cheap. We live in a world where the demand of developers far outpaces the supply. So that means you have to pay developers a lot and you don't have as many developers as you wish you did.

Which leads to...

  1. What do you want your expensive developers to spend their time on? On a task they've never done, that you could instead outsource to experts who focus on the problem everyday? What else could the developer be doing with their finite time to provide greater value to the company? Focus on what you're great at, and outsource the rest. (I wrote more on the concept of developer productivity here).

This gets to the crux of the question businesses ask themselves when buying from vendors. Do we build (the solution ourselves) or do we buy (the solution from the vendor)?

Ian Coe
founded Tonic because he believes businesses want to buy this solution, not build it.

Tonic helps businesses generate fake datasets, that look and act like your actual customer data, in a few clicks. This allows developers to be more productive, without worrying about skirting security procedures.

Tonic is no longer compelling in just highly-regulated industries like banking and healthcare. I'm sure you've seen the headlines about how we need to break up big tech. I hate headlines. But the truth is, governments are coming after all technology companies.

Laws like GDPR and CCPA
forces companies to implement stricter control on consumer data. How companies handle it and what employees can access it is tightening. This exponentially increases the market Tonic is addressing. It's grown from heavily-regulated industries to any moderately-sized digital company.

Security is making companies get more strict about accessing consumer data, but at what cost?

The shift to work from home has introduced security risks. Data is no longer only accessed on a company's premise. It now has to be accessed from an individual's home through an internet connection (the cloud).

People accessing customer data from their homes, means less secure environments. They are potential security breaches waiting to happen. With 2020 being another record year for data breaches, mitigating data breaches is more important than ever.

Companies face a critical question as they look to create more secure environments: are we willing to accept reduced productivity from our developers for stronger security? Because too often, stronger security means more hoops for someone to jump through to complete a task.

It reminds me of the airport. ~99.99% don't plan to cause harm on a plane. We just want the toddlers to be peaceful and the seat next to us to be empty. But, we all have to go through the same security checkups, adding significant time to get to our destination. On the other hand, it makes sense for the TSA. If one wrong person gets through, it would cause significant damage to society.

So, is there a way to implement proper security controls, without reducing productivity?

For most companies today, not really. Companies approach it one of three ways:

  • Productive & Secure: Only have a small team of trusted developers who have access to all the data. This allows specific developers to be productive, but creates a bottleneck as tasks that need data access grows.
  • Productive & Scalable: Give open access to all developers. But this creates security risks and could be against the law.
  • Secure & Scalable: These are clunky legacy solutions. They're difficult to use so they limit developer productivity. It's important to have your expensive developers working on the right tasks.

The problem with each approach above is it has 2 important characteristics, but misses a critical third characteristic. Whether that's being productive, secure, or scalable. That's what makes Tonic intriguing for companies. It satisfies all 3 characteristics:

  • ✅  Scalable; Tonic lets all developers access the data.
  • ✅  Productive; Tonic makes it easy to use.
  • ✅  Secure; Tonic creates an anonymized version of data.

Security is important, but the cost is productivity. How do we ensure security while still being productive?

How does Tonic actually generate datasets?

Recent advancements in data generation allow Tonic to combine many approaches to create datasets that are:

  • privacy-preserving, and
  • useful

The first approach is anonymization

We've discussed this. It's where you change data in certain fields that can identify someone (your personal identification information, i.e. last name or social security number). This helps the dataset look and feel like the original source from a behavioral, statistical, and structural standpoint, meaning the data will be high utility (be useful).

But there's one issue when you use data anonymization. It can be de-anonymized by a malicious actor. And there's countless examples of it.

In 2006, Netflix was tired of hearing how the recommendation engine sucked (I think it still sucks, but I also just rewatched The Wire for the fourth time so maybe the problem is with me not trying something new).

So Netflix said to the world...

"OK you jerks. Here's a dataset of 100 million movie ratings from 500,000 users. If you improve recommendations by 10%, you get $1 million dollars."

Netflix anonymized the dataset to maintain user privacy. But two pesky researchers from the University of Texas said, "hey, you know you guys did a shitty job anonymizing this, right?" The researchers cross referenced the Netflix dataset with IMDB ratings people gave under their own name. It turns out after filtering out the top 100 movies, our watching habits are quite unique. Once the researchers identified someone, they could determine a lot of personal characteristics on that individual:

First, we can immediately find his political orientation based on his strong opinions about “Power and Terror: Noam Chomsky in Our Times” and “Fahrenheit 9/11.” Strong guesses about his religious views can be made based on his ratings on “Jesus of Nazareth” and “The Gospel of John”. He did not like “Super Size Me” at all; perhaps this implies something about his physical size? Both items that we found with predominantly gay themes, “Bent” and “Queer as folk” were rated one star out of five. He is a cultish follower of “Mystery Science Theater 3000”. This is far from all we found about this one person, but having made our point, we will spare the reader further lurid details.

It turns out Netflix may have exposed the political affiliation, religious orientation, BMI, sexual orientation, and who knows what else on hundreds of thousands of users!

While anonymization provides a high-quality dataset to test software, it's a disaster for user privacy if it ends up in the wrong hands.

Synthesis is the yin to Anonymization's yang

What if we just came up with a new dataset from scratch? The ultimate form of privacy-preservation. The dataset would be of no value to hackers because all the data would be fake! It doesn't tell you anything about anyone that's actually real! That's what synthesis does. It's gotten a lot better due to the underlying technology, machine learning. Machine learning is a subset of a term you're probably more familiar with, artificial intelligence.

Before, we would use rules to codify machine learning models. "If this, then do that". But now we live in a world with:

  1. Computers that cost a fraction of what they once did
  2. Access to humungous datasets
  3. More sophisticated algorithms that mimic how humans learn (neural networks)

This has enabled machine learning researchers to go from a rules-based approach to a data-driven one. Throw heaps of data at tons of computers that are low-cost and run neural network algorithms.

An example I saw of this recently was hilarious. There was this VC on Twitter that was complaining that Instagram sucks these days. His explore page is full of half-naked women shaking their booties. He said Instagram was appealing to our most primal instincts by promoting sexual content. Someone else responded, "uhhhhh, Instagram is just serving you what you want. My explore page is biology and space posts!"

This is an example of machine learning at work. You generate data points by stopping, watching, and liking certain Instagram posts. Instagram feeds that data into their machine learning algorithms. Instagram predicts you'll like similar content to what you've engaged with in the past and shows you that, even if you're a sick bastard.

When trying to expose Instagram, actually exposes you.

BTW - if it wasn't clear yet, I love basketball and Instagram knows it.

So synthesis is great to protect user privacy by creating fake datasets, but what sucks about it?

While machine learning can generate fantastic results, it's still complex to use. The dataset you are trying to feed into it can be quite complex. You could spend more time designing and refining the algorithm than actually generating fake data to do software testing. Making it a little impractical to use synthesis.

And that's why Tonic bred anonymization & synthesis to have a baby and called mimicking

Data mimicking takes the pros of each in a way that negates their cons. This means you can quickly generate realistic test data, without compromising user privacy.

Data mimicking is the best of both worlds.

I want to wrap up this post with a few specific examples of how Tonic is helping their customers.

Everlywell, a healthcare company worried about (actual) HIPAA violations

Let me begin by saying whenever my girlfriend asks me if I worked out today (and I didn't), I respond informing her that it's a HIPAA violation. It's not a HIPAA violation.

HIPAA is actually a law that states health organizations must protect sensitive patient health information from being disclosed without the patient's consent or knowledge. Everlywell is in the healthcare space, and we know highly-regulated industries (like healthcare) need Tonic's help to test without accessing customer data. So Everlywell turned to Tonic.

Everlywell sells home-based lab tests. To get a sense of how fast they grew during COVID, they went from 80 employees to 1,000+ in a year!

To keep innovating for customers, Everlywell is always testing new features. This led to the creation of 50 to 100 sandbox (similar to staging) environments. Due to HIPAA regulations, they needed to avoid using customer data and instead come up with fake data for every environment. Each environment took up to an hour to get high-quality fake data in it. Tonic transformed that process to an automated one that took less than 5 minutes to complete.

Implementing Tonic had two benefits:

  • 3 engineers would be assigned to this if Tonic wasn't in place. Now, the company can save money on automating those roles. Or have those engineers work on other projects that are critical for Everlywell to launch.
  • Everlywell is now launching new features 3-5 times per day vs 1 per day before Tonic. Being able to ship faster (reduced time to market) allows Everlywell to improve the experience for their customers, faster. This helps them drive customer loyalty and revenue growth!

Flexport, a global corporation in a "super-boring", but not heavily-regulated industry

Remember how we said new regulations and data standards would make Tonic compelling not just to heavily-regulated industries, but any digital company? Tonic's work with Flexport is an example of that.

Flexport is helping digitize a paper-based industry - global trade. Being a global business, means they have data located all across the world on their customers. These are potential security & privacy risks. So Flexport worked with other vendors that offered the security they needed to protect their consumer data. But no vendor provided the ease of use to enable developers to get shit done.

When Flexport tried to solve this in-house, they realized the fake data they created would become obsolete within a week. Their current customer data was always changing, so they would have to create a new fake dataset each time they wanted to test. While both solutions Flexport tried helped them get the job done, the limiting factor was developer productivity.

Tonic helped automate the process to improve developer productivity. Now, anytime a developer needs fake data to test with, it's a few clicks away. The output generates a data representative of production databases, without any sensitive, personal information attached to it.

Tonic helped Flexport in two ways:

  1. By safeguarding customer data, they were in compliance of global privacy obligations and SOC 2 compliance. For those that don't know, SOC 2 is a critical certification for any vendor that handles customer data. It's tables stakes for closing deals with customers.
  2. It made 2 departments that often are at odds, happy - the security team and the engineering team. Security teams worry developers are taking unnecessary risk. Developers think security teams get in their way of shipping code/new features fast. Tonic assures security, without crippling developer productivity.

And that last sentence is the simplest way I can describe Tonic, so I'll say it again. Tonic assures security, without crippling developer productivity. Go get you some fake data.

You made it!

Thank you for reading. I know it was a long one.

If you'd enjoyed it, I'd appreciate it if you liked or retweeted the thread on Twitter (linked below). I really put my heart into this one:

Thanks to Rishika for editing and giving me the confidence to press send. Thanks to Shaan for sharing this post and helping it blow up!