zBase – A high-performance, elastic, distributed key-value store
zBase is a distributed key-value store that is the storage backbone for almost all Zynga games. It supports the memcached protocol on the front-end and is designed for persisting data to disk along with in-memory data replication for high availability. zBase was forked from membase in early Jan 2012 and, since then, the storage services team has added a host of useful features and made many performance enhancements and has been successfully been deployed on more than 6000 zBase nodes in production. We stood on the shoulder of giants and built some really cool stuff and its time that we gave back to the OpenSource community. The remainder of this post will be spent on describing some of the significant features that we have added that make ZBase what it is today.
Improved Memory Management and Multi-Disk Support
In the lifecycle of a typical Zynga game, the total install base continues to increase, while the Daily Active Users (DAU), which is a percentage of the total installs after a certain point, stabilizes and remains constant. To meet the ever-increasing demand of storage without incurring huge costs, it is imperative to support higher data density on each zBase server. This implies that while the total data for the game increases with time, the number of servers needed to serve the data need should not increase – provided the active user set remains constant. The primary objective here is to pack more data per server than can be fitted into memory without any visible performance degradation to the client.
Some of the enhancements that we made in order to achieve the goal of increased Disk greater than memory (DGM) ratios are as follows:
Improved Memory Management
The original membase memory management model was based on an internal counter called mem_used . This counter was incremented and decremented every time memory was allocated or deallocated. However, this counter did not reflect the true state of the system since tcmalloc , the allocator that is currently used by zBase, never returns free memory to the system. In many situations, the memory would get fragmented and mem_used counter would not be the correct indicator of the process Resident Set Size (RSS). Since a lot of operations in zBase were based on the value of mem_used , this posed a problem where mem_used looked normal but the RSS would be out of bounds. This resulted in either the system running into swap or killing the process, especially during periods of such heavy pressure.
The solution to this problem included two changes:
- Using RSS to track memory usage : Since RSS truly reflects the amount of system memory zBase really uses, the memory tracking for every operation was changed to use this value instead of the “mem_used” counter. When the RSS crosses a critical threshold, zBase does not allow any front-end operations until RSS falls below a certain value.
- Using jemalloc as the memory allocator : With tcmalloc as the allocator, there was no control over making the allocator release free memory to the system. Changing the allocator to jemalloc provided a lot of control over releasing unused memory. Every time a cache flush is issued, all the free pages under the thread’s cache and the global arena are released back to the system. The results of this change were very noticeable and memory utilization was much more predictable.
Improved Eviction Framework
Membase uses a random eviction model to decide which keys to evict in order to free up memory for future allocations. Whenever the allocated memory exceeds a certain high-watermark, usually 70% of the max memory size, membase will randomly evict non-dirty keys until the mem-used drops to below 60% of the total memory. The random eviction model wasn’t really suitable for holding the active set in memory, particularly if the total data size exceeded the memory size by a large amount.
We refactored the code to allow the eviction algorithm to be pluggable, so that programmers could quickly implement their own models. We have implemented the LRU eviction model since because we found that worked well for our use-cases. With the right of sort of capacity planning, it became possible for us to achieve good DGM ratios and also ensure that the hot dataset belonging to the active users was always retained in memory.
The LRU eviction algorithm that we have implemented works as follows:
- Whenever the process RSS crosses a predefined and configurable headroom mark, the LRU pager will start evicting non-dirty and least recently used items from memory until the process RSS falls below the headroom mark.
- If the rate of eviction cannot keep up with the increase in the process RSS, then once the RSS reaches the max memory size, no more front-end operations are permitted until the LRU pager manages to create space for new allocations.
- This strategy ensures that process RSS will never exceed the max memory size and the LRU pager will continue to evict until the RSS falls below the headroom mark.
Multi-KVStore and Support for Multiple Disks
Ep-engine or Eventually Persistence Engine is the core of zBase architecture providing persistence capability to memcached front-end. The KVStore abstraction in ep-engine is responsible for laying out data on persistent storage and providing interfaces to access/manipulate the data. Prior to this change, the ep-engine used single instance of KVStore and one execution thread for I/O, which limited the I/O performance. To handle increased I/O load resulting from higher data density, ep-engine was enhanced to use multiple KVStores with each KVStore getting its own storage and I/O thread.
When zBase server is started, it reads a config file in json format to get the KVStore configuration information. The configuration specifies number of KVStores and the properties of each KVStore. This configuration is fixed for the lifetime of server.
While starting up, zBase either creates the databases for each KVStore (if it does not exist) or reads the existing database files to populate in-memory hash table. It also creates I/O threads (called flusher) and corresponding I/O queues (called flushList ) to handle I/O operations on each KVStore. FlushList uses a highly optimized lockless data structure that allows multiple producers and single consumer to work concurrently. Therefore, multiple front-end threads can queue mutations to these queues without additional locking overhead. Each flusher thread picks mutations from its queue and performs the I/O independent of the others.
This design allows I/O performance to scale with storage. For example, if there are “n” KVStores where each is laid out on a separate independent disk, the server will be able to do I/O “n” times faster than a single disk + KVStore combination.
Incremental Backup and Restore (IBR)
zBase has the ability to create incremental backups at a configurable frequency. The ep-engine core achieves this by creating a checkpoint at regular intervals. All mutations that come in are referenced in the checkpoint that is currently open. At a preconfigured time period, this open checkpoint is closed and mutations belonging to this now closed checkpoint can be backed-up. The incremental backup file that is created is basically a sqlite database which is indexed on the tuple (key, vbucket).
The IBR system in zBase consists of the following modules and processes:
Storage Server/Backup Nodes:
These are specialized nodes that contain multiple independent disk heads for storing backup data from vbuckets. Incremental backups from vbuckets need to be coalesced from time to time in order create fully daily and weekly snapshots. This backup coalescing workload is disk I/O heavy and therefore having multiple disk heads helps in parallelizing and speeding up such operations.
The backup daemon runs on each of the storage server nodes and is responsible for creating incremental backups at preconfigured intervals. The backup daemon spawns a separate process for each disk in a storage server. Each backup process gets a list of vbuckets belonging to the disk that it is operating on from the diskmapper and sequentially creates incremental backups for those vbuckets. In this way, multiple vbuckets can be backed up simultaneously.
Full Restore and blob-restore daemon
This daemon is installed on each storage server in the system and listens for full restore and blob restore requests. A key belonging to a vbucket can be restored to a state sometime in the past as long as the backup data for that duration is present on the storage servers.
A set of vbuckets can also be restored to their current state by running the restore client on the zBase server on which the vbuckets are to be restored. The Restore daemon looks up all backup data belonging to those vbuckets and transmits the backup files to the restore client. The restore client replays the mutations contained in the backup files to the zBase server in reverse chronological order.
This component is responsible for creating a mapping between vbuckets and disks on the storage servers in the pool. The disk-mapper creates two partitions on each disk in each of the storage server, i.e. a primary and a secondary partition. The disk-mapper creates a mapping that adheres to the following set of rules:
- Each vbucket is mapped to exactly one primary partition and one secondary partition. This ensures that we have two copies of the backup data for each vbucket on the storage server pool.
- The primary and secondary copies of a particular vbucket reside on two distinct storage server nodes. zTorrent, which is internally developed torrent system is used for keeping the data between the primary and secondary copies in sync with each other.
- 10-20% of the disks in the system are designated as spares to be used in event of disk failures.
If a disk on the storage server fails, the disk-mapper allocates a spare disk to be used in place of the failed disk.
Each storage server nodes contains two coalescing daemons. A daily coalescer daemon merges incremental backups from the last 24 hours and creates a daily backup for each vbucket. The data contained in the set of daily backups for a particular day correspond to the users who have played the game on that day. These daily backups are useful for restoring a user’s key to a state that existed on a particular date and time.
The master coalescer daemon runs once a week and creates a full snapshots of the each vbucket in the system. The master coalescer job uses all the incremental backups along with last 7 – daily and the last full master backup to create the most recent full master backup of that vbucket.
In order to protect a user’s data against data corruption, ZBase provides and optional data-integrity features that enables checksum verification the data blobs as it passes through each layer in the zBase storage stack. The following diagram shows the various layers within the zBase stack. Data can get corrupted at any of these layers and, therefore, the checksum of the data is computed and verified at each of these layers.
If at any point the checksums contained in the metadata do not match the checksum computed on the data, an error is returned. In this way, it becomes possible to detect data corruption errors and identify at the layer at which the corruption has occurred.
The original zBase protocol based on the memcached protocol is defined as follows:
Header:set <key> <flags> <expiry> <value_length> [<CAS>] [noreply/returnCAS]
Header: <key> <flags> <value_length>
The checksum for the data have been included as part of the header so the modified zBase looks like the following:
New set request header
set <key> <flags> <expiry> <value_length> <checksum> [<CAS>] [noreply/returnCAS]
New get response header
<key> <flags> <value_length> <checksum>
Data checksums can be enabled by the application by setting a flag in the zBase protocol header.
zBase 2.0 has the ability to dynamically scale the size of the cluster without the need for quiescing the system. To achieve this, zBase partitions the entire key-space into smaller logical units known as vbuckets. Each vbucket can be independently mapped to any server in the zBase deployment pool. zBase internally maps a vbucket to an individual disk. When a disk fails, the list of vbuckets that were mapped to that failed disk need to be made available from another spare disk in the pool. Similarly, a server failure is treated a multiple disk failure and the vbuckets belonging to those failed disks are brought online elsewhere in the system.
Multiple replicas of a vBucket can exist in the system; However, at any given time, there will only be one active copy of a particular vBucket. Client read and writes requests are served from this active vbucket. Similar to the original membase design, vbuckets in the system can be replicated, transferred or brought offline. In zBase 2.0, we have re-architected the design of the cluster orchestrator with the intent of allowing the zBase cluster to scale beyond hundreds of nodes. zBase 2.0 cluster achieves this goal by
- Limiting the frequency of control messages exchanged by between individual components in the cluster
- Delegation of keeping the cluster consistent. While the cluster orchestrator maintains the overall state of the system, each zBase node in the cluster is responsible for ensuring that active vbuckets are replicated to their corresponding replicas.
zBase 2.0 cluster consists of the following components:
VbucketServer (VBS): The vbucketServer is responsible for publishing the mapping between vbuckets and servers known as the vbucketMap and making it available to the vbucket-aware clients in the deployment pool. The vBucketMap is updated whenever there is a cluster reconfiguration due to a disk, node failure or cluster resize. vBucket aware clients maintain persistent connections to the VBS, which publishes a new vbucketMap each time there is a cluster reconfiguration.
Vbucket-aware client . A vbucket aware client such as moxi is needed to pull the vbucketMap from the vbucketServer and redirect client requests to the appropriate server in the cluster. In this model, clients who initiate memcached requests are not aware of the server locations and only communicate with Moxi, which in turn contacts the appropriate zBase server on the behalf of the client to service the request.
vBucketMigratorAgent : This component is installed on each of the zBase servers. The vBucketMigratorAgent manages vbucket migrator connections to replicate the vbuckets across the cluster. Every vbucket mapped to a disk on a server is replicated to another disk on a different server within the pool. The vBucketMigratorAgent is the component that will manage these replicated vbuckets.
vBucket Enabled Ep-engine. Ep-engine component of zBase will be vbucket aware. Ep-engine will be responsible for mapping vbuckets configured on a single server to the individual disks contained within the server.
zBase, since it forked from membase, has undergone a massive transformation in terms of both stability of the product and performance. The primary objective of the storage services team was to reduce operational costs by decreasing the total number of zBase nodes required by a game deployment and also reduce the manual effort and pain associated with resharding process. A reshard process required the game team to manually bring up a new pool and transfer all the keys from the old pool to the new. Once all the keys were transferred to the new pool, user connections are switched to hit the new pool.
The improvements such as LRU eviction policy and multi-KV stores allowed zBase to dramatically increase the amount of data that could be stored and served from a single zBase node. With the introduction of zBase Incremental backup and restore (IBR), we were able to reduce the bandwidth and storage space requirements for Zynga’s Business Continuity Policy (BCP) by a factor of 80%. Finally, with zBase 2.0 we have re-architected the entire clustering design to ensure that pool resizing can be carried out painlessly and cluster sizes can scale up to thousands of nodes.
Additional Links and Resources
1. zBase source code . http://github.com/zbase
2. zBase documentation . https://github.com/zbase/documentation
3. JeMalloc: http://www.canonware.com/jemalloc/