Home
| Calendar
| Mail Lists
| List Archives
| Desktop SIG
| Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings Linux Cafe | Meeting Notes | Linux Links | Bling | About BLU |
On 12/07/2010 07:29 AM, Dan Ritter wrote: > On Tue, Dec 07, 2010 at 12:07:13AM -0500, Mark Woodward wrote: > >> The most important aspect of these scalability discussions, and one >> which I frequently find lacking in the "NoSQL" camp is a critique of >> just how many full scale transactions a SQL database can have. >> > ... > > >> average head positioning time. This gives us a worst case average of >> 77 arbitrary write procedures per second. >> >> The actual iops is higher based on seek time these days, but lets use >> the worst case scenario. (We could use a RAID system and multiply >> performance) >> >> In an ACID database configuration, assume 1/2 maximum, i.e 36 writes per >> second on the database. That's 36 transactions (read/write) per second. >> Assuming 1 transaction per page view, that amounts to about 90 million >> page views a month (on average) as a sustainable number. >> >> So, a good SQL database with no scaling tricks on a bog stock modern PC >> based server will serve a web site as busy as all but the very most >> popular sites on the web. Someone, please tell me, what are the NoSQL >> guys going on about with regard to scalability? >> > There are basically no sites that use a database and only want one record > view/update per page view. > For the sake of argument, "one transaction" is short hand for one write/update operation. I'm assuming a databases page cache and/or memcache (or something like it) will cope with repeat session based data. But, that's a design detail, but your point is taken. > Let's say we have, oh, a dating site. The login page may only need one DB > query. How about the second page, where they will helpfully show you the > names, photos and blurbs of ten or so eligible people, plus a check on > how many messages you've received, plus another on how many you've sent, > and... you can get into 50 queries pretty quickly. Add some billing and > logging records and you could have a couple of updates in there, too. > Again, not to be pedantic, common queries where nothing is modified are a lot less expensive. Obviously Facebook is a completely different animal, i mean, jeez, they practically have the population of the earth on-line. That's not what I'm talking about, a "dating site" is an interesting case. Lets break it down, how many members? How many members on-line at any time? What are the usage patters? Let's assume you use PHP with some sort of distributed session cache mechanism. I'm using a system I designed for DMN a few years back as the basis of my response. A login, yes one transaction two disk I/O operations, retrieval of data and storage of change of state. The login also brings the user's information into the page cache of the database, so most of the subsequent select operations that deal with deal directly with user info will be hitting the page cache and avoiding the disk. The first page view will always be the hardest, that brings the most data into the various caching subsystems (php session cache, any memcache object definitions, and, of course, database disk page buffers.) Subsequent page views as the user uses the system, tend to hit the page cache and only alterations of data and retrieval of data not currently in the page cache force I/O operations. The efficacy of the page cache is directly proportional to the number of members in total, and the number of members on-line and active at any one time. You aren't going to have a thousand users synchronizing their watches and pressing a button at the exact same time. So, unless your load forces the database to read disk pages for every single select operation because the variability of the requests exceeds your page cache, user actions (multiple disk I/Os per page) can be averaged out over a number of page views. > Now, a clever system will aggregate the most often requested data into > a single row, when it can... that helps with the second page, but not > with profile pages or a mail-analogue system. And a clever system will > cache as much frequently-requested data as possible. > Many databases do this transparently, but yes, systems like memcached can be very helpfull. > But then there's Facebook, and anything even vaguely like it. The obvious > first thing to do is to optimize time at the cost of storage: store a > copy of the new message with each of the recipients, so that when they > read their update page, that takes one transaction instead of as many > as they have friends. And pretty quickly you run out of disk space. So > you decide that you have a read-mostly environment, and have many, many > read-only slave databases to handle lookups, and most of those lookups > aren't needing anything near the power of SQL -- really, you could get > away with a key-value lookup, if only it were FAST -- and before you > know it, you've re-invented MongoDB. > Facebook, like the video I posted, Facebook *has* to implement these things and for good reason, the scale at which they operate is HUGE. Unbelievably huge. But, it took them several years to get where they are and, like I was saying, all but the most busy web sites would well be handled by a standard database. Yes, if you are facebook, you have a different set of problems. Hell, if you are L.L. Bean you have a different set of problems. However, 99% of the web sites out there will never have those problems. As the saying goes, "Let's hope we have to address these problems later." > By way of contrast, I know there are sites which do real work on > data people care about in sophisticated ways -- and they > generally use full SQL databases, because that's what they need. > > SQL and ACID solve some tough problems. Key/value systems solve > some speed issues, as do caches. There are lots of solutions > because there are lots of problems. > My issue with the NoSQL mentality is the idea of "no." You can't have a conversation when it starts with "no." Then there is always the refrain: "Use the right tool for the job" which is a euphemism for limiting your options and usually the defense of the indefensible. Well, more often than not you can't ever really know what the right tool is or the level of sophistication required at the beginning of a project, especially with AGILE type development. So, you start down the road myopically focusing on one aspect of the project "scaling" whilst completely ignoring vast parts of the project, like, what are you going to do with the data once you get it? What MIGHT you WANT to do with the data? Sure, you're covered for the next 5 years of (wishful thinking) scaling at the expense of accessing your data. Most all of the NoSQL solutions are dreadfully lacking in any sort of capability to allow you to capitalize on your data beyond mere regurgitation to a web page. Your data will never become "information" and information is where the money comes from. My general rule of thumb is, "it is better to have it and not need it, than to need it and not have it." So, I would say, the question shouldn't be "Do I need a SQL database" it should be "Are there areas where a SQL database doesn't work?" Mark.
BLU is a member of BostonUserGroups | |
We also thank MIT for the use of their facilities. |