2008年5月16日 星期五

MySQL Camp Google Notes

After speaking at Yahoo earlier, I drove a few miles down the street to Google for MySQL Camp. I caught the last session of the day, by Googlers saying how they used MySQL internal to Google. (I assume for the Adwords application.) Here are the stream-of-talking notes I took. The most fascinating bits I took out of it is how they take a partitioning/sharding strategy similar (but notably different in some ways) to WordPress.com and that they use DNS to manage all load balancing, high availability, datacenter failover, etc. DNS is a pretty powerful building block.

Application level partitioning of databases. Put the identical schema on multiple databases, the application knows how to find the data across different databases. Split by account ID.

Primary, commodity RAID subsystem, maybe a battery cache

Replicas are on identical hardware.

Slim replicas are on cheap whitebox hardware, cheapest they can get. Horizontially scaled in shards.

Really simple extensions to SQL implemented in Python. Looks like MySQL command line, but it executes on all shards. No two-tier replication, everyone talks to the master.

Try to avoid multi-megabyte inserts and such. Funny things with truncated binlogs. (Hundreds a day.) Also bandwidth limitations. Some data is on all shards, but typically data is unique to the shards.

High availability and failover - require application to respect DNS. Use DNS to steer the load of the database systems. Given application can tolerate hard failures. Traffic light - red, yellow, green. Red, take everything off. Yellow, existing stuff stays, no new stuff. Python script that runs on cron. Balancing betwene read slaves. All hosts have all data.

Replicas are within 5-10 seconds in regular operations. Discourage all reads from master. Largely IO bound, nothing in memory. Usually innodb sizes are above 15 times the memory.

DB - shard number. Physical name, logical name. Example: db accounting, shards 0-5.

Read/write ratio. Order of magnitude, 10x more reads than writes, easily more than that.

Failover for masters. Speed up the planned failover. DB3 points at an IP, want to switch to a new IP. They have a way to do that with about 15 seconds of downtime, they modify the DNS TTL. Use priviledges as a barrier to write access. Create a copy, it’s writing a binlog, it’s read-only. Make the change, let DNS propogate, kill everyone without REPL, turn off writes on the old master, when everything is clean switch to the new master. Usually <>

Automated way to bring slave databases up. They all talk to the same master.

Unplanned failover, list of replicas in DNS. Don’t need a sync, just move the DNS and push permissions that allow people to write to the new master.

Ongoing production operations: complete backup 24 hours of all primaries. They use ibbackup. Taking the backup, restore it automatically to another place as a snapshot. It’s slow. It’s one command to set up a new replica. We’d have to lose a large number of machines for it to touch our data.

MySQL guy: They’re releasing to MyISAM++ will do everything InnoDB does, but way better. Readers don’t block writes, etc etc. Different from Falcon. Both will be out next year.

Schema changes. Fairly formal schema change process, goes through a committee that looks at the schema changes. They get checked into special source control data. Schema changes cause problems. Planned downtime once a month. That’s when they do something that will lock a big table for an hour. Commonly: add keys, add fields, creating tables, creating indicies. MySQL guy: adding index, add/del column live in future version.

Query snipers, they’re supposed to be killing queries that violate policy in one way or another. 3 names of CNAMES going from normal name, logical name, and physical name. Thousands and thousands of machines.

Production monitoring, PROCESSLIST, INNODB STATUS, etc. Small piece of glue that extract that data and put it into company-wide monitoring systems. Access controls, lots of users, lots of tables, very percise permissions. Modify MySQL tables directly, do FLUSH PRIVLEDGES. Something about Fedora Core 3 statically links the glibc fragments memory per connection, causes bad out of memory errors. Fixed in MySQL.com binaries. Roles in MySQL?

System growth and scaling. Adding new partitions is easy, set up new partitions and put new data in it. What’s not easy is migrating existing data. Create new shards relatively rarely. Shards are about the size that fit on 2U commodity hardware, I would estimate a few hundred gigs. They shard different types of data differently and separately, and scale each independently. They add more replicas to the mix, frequently. Always moving in and out of datacenters, constant churn.

Track problem queries. What takes the most resources, who are they, why are they using? Also track number of queries. Individual tables all grow at different rates. Historical data can be moved to different types of systems. Capacity planning and trending. How different machines perform vs each other. Everything grows at “google pace”. On-call rotation of 6 guys.

Data dumper
News reporter

Trying to decentralize support process, scale slaves etc, a single machine running cron jobs can get stuck. Bad clients that leave queries running. If a client disconnects, MySQL keeps the query running.

Diferent guy, talking about InnoDB: Mark Callaghan. Was at Oracle for 8 years. InnoDB is a joy, some of the best database internals code he has seen. In 5.1, page-level compression is moving along. OLTP? He’s IO bound, but he can easily trade-off performance for data size etc. Shrink DB by 2x. One feature he’d like, 5 has a lot of enhancements for managing integrity between X and binlog.

A few issues, InnoDB was written for single disk and single CPU systems. Was written in mid-90s. Doesn’t know about SMP problems, because they’re IO bound. IO is not much work to fix either, InnoDB uses background threads to do IO because it doesn’t trust Linux async IO. You get four threads. Each thread is doing sync IO. OLTP, you’re bound flushing dirty pages to disk. You want one or more write threads per disk. Concurrent sequential scans, innodb can issue prefetch requests, the buffering that the OS cache doesn’t help much. You want more than one pending IO at a time. How can you get more? It looks like all you have to do is modify one file, make some simple changes. Someone started, but never finished.

Simple perf benfits, was able to saturate PCI bus, 120MB/s with more write threads, in default config could only get 60 MB/s. Random IO benefit isn’t as much. Linux 2.6 has 4 IO schedulers, you can utilize the system a lot more. Noticed a specific scheduler that works the best? Uses software RAID on multi-disk (1-10 disks), deadline and participory are about the same. CFQ and O-op is much worse. RAID 0, 1 meg stripe size.