Deciding How to Store Billions of Rows Per Day
It’s pretty well known that Zynga has a metrics driven culture. It’s true. We meticulously measure most areas of our business, from user survey results, to engagement levels in our games, to the actions our players take during a game session. As a result, we generate (and analyze) tens of billions of rows of data daily. The question is, how do we store it?
From SQL to Big Data and Hadoop
When we first set out to build our corporate data infrastructure two years ago, the first thing we had to figure out was how to answer that question. It wasn’t an easy task. For the majority of analyses we needed to do, SQL as a query language was a great fit. Plus, a huge population of people knew SQL, including most of our product managers, and there was a vast market of graphical analysis tools that work with SQL databases for those who didn’t know SQL. Given our metrics-based DNA, having a solution that was accessible by everyone (not just engineers) was key to nurturing our culture.
But, conventional wisdom at the time was that SQL databases just couldn’t handle really large scale data. The NoSQL approaches were rapidly becoming popular, and for many people Big Data was becoming synonymous with Hadoop . Hadoop gives you a high level of scalability and flexibility, which were two things we really needed. So, we went with Hadoop, right? Actually, no. To understand why, let’s look at why several web companies that have Big Data have adopted Hadoop.
1. Scale, Baby, Scale: The first reason is scale. Many of these companies generate enormous amounts of web log data, which often exceeds 50TB of new data daily. SQL databases can’t feasibly process these enormous logs. Also, web logs have semi-structured data, making them a poor fit for processing with SQL even at a smaller scale.
2. Schema and Semantics: The second reason several web companies have adopted Hadoop is the issue with SQL’s schema. With SQL, you need to model the data schema up front, so you have a place to put the data you’re loading. But these Big Data web applications are highly dynamic, just like Zynga’s games, with changes happening weekly or even daily. It’s not feasible to update the SQL schema that frequently. Hadoop doesn’t have a schema, so you can just add any new data you want to the system and then define the data semantics later when you write your Hadoop / MapReduce programs to process the data
Scaling in SQL
But, we came back to the fact that for the types of analyses we do at Zynga, SQL is the most efficient language and SQL databases were the most efficient way to process these queries. Hadoop is more flexible and can be applied to a vast set of complex problems, but the downside of that is it’s less specifically optimized for running analytical queries. And while solutions like Hive and Pig give you a SQL-like front-end for analysis, they don’t make the MapReduce backend more efficient at processing queries.
So, we turned our attention to identifying an approach that would allow us to use a SQL engine, yet avoid the scale and static schema issues. At the heart of the scale issue is the size of the logs. So, rather than processing 50TB of semi-structured data to get at what for Zynga amounts to around 5TB of structured data, we decided to avoid using the logs as the data source, and instead created a set of functions that our games would call to write data directly into our database. Since we only track the data that the games explicitly want to track, and it’s all structured data with none of the excess cruft found in logs, the scale issue goes away. We just write the 5TB of structured information that we need, and this is a scale that high-end SQL databases can definitely handle.
There are also a few secondary advantages to this approach. Since the data is stored by calling a function, that means that it’s fully structured, so there’s no need to parse or transform the data before you can query it. Of course, the challenge with this approach is that it requires each of our games to be instrumented to call the data tracking functions, rather than inspecting the logs after the fact to figure out what happened. But the benefits for us have massively outweighed the costs.
Focusing on the schema issue, a very simple solution has worked for us. Rather than needing to change the schema weekly, we provide functions that let our games store data in something similar to generic “name-value” pairs in our database for those types of data that don’t naturally fit into existing tables in the schema. Of course, one of the challenges is that they’re generic name-value pairs and can represent anything the games want, so there aren’t any pre-existing semantics we can use to run prebuilt reports against that data. But, we provide reports that let the user specify which data is theirs and gives them the ability to slice and dice it, applying their own semantics to the data they inserted.
Addressing the scale and the schema issues meant that we could in fact use a high-end SQL database to meet our needs. In our case, we chose Vertica as our database vendor. Two years later, it seems our approach has worked, giving us an analytics platform that is efficient and accessible not just to our engineers, but to non-engineers in product management, game design, marketing, finance, and customer support.
If you’ve had to deal with massive amounts of data, what criteria did you use in making your technology selection? Please share any and all experiences in the comments below.