Boston Linux & Unix (BLU) Home | Calendar | Mail Lists | List Archives | Desktop SIG | Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings
Linux Cafe | Meeting Notes | Blog | Linux Links | Bling | About BLU

BLU Discuss list archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


On 12/07/2010 07:29 AM, Dan Ritter wrote:
> On Tue, Dec 07, 2010 at 12:07:13AM -0500, Mark Woodward wrote:
>> The most important aspect of these scalability discussions, and one
>> which I frequently find lacking in the "NoSQL" camp is a critique of
>> just how many full scale transactions a SQL database can have.
> ...
>> average head positioning time.  This gives us a worst case average of
>> 77  arbitrary write procedures per second.
>> The actual iops is higher based on seek time these days, but lets use
>> the worst case scenario. (We could use a RAID system and multiply
>> performance)
>> In an ACID database configuration, assume 1/2 maximum, i.e 36 writes per
>> second on the database. That's 36 transactions (read/write) per second.
>> Assuming 1 transaction per page view, that amounts to about 90 million
>> page views a month (on average) as a sustainable number.
>> So, a good SQL database with no scaling tricks on a bog stock modern PC
>> based server will serve a web site as busy as all but the very most
>> popular sites on the web. Someone, please tell me, what are the NoSQL
>> guys going on about with regard to scalability?
> There are basically no sites that use a database and only want one record
> view/update per page view.
For the sake of argument, "one transaction" is short hand for one 
write/update operation. I'm assuming a databases page cache and/or 
memcache (or something like it) will cope with repeat session based 
data. But, that's a design detail, but your point is taken.
> Let's say we have, oh, a dating site. The login page may only need one DB
> query. How about the second page, where they will helpfully show you the
> names, photos and blurbs of ten or so eligible people, plus a check on
> how many messages you've received, plus another on how many you've sent,
> and... you can get into 50 queries pretty quickly. Add some billing and
> logging records and you could have a couple of updates in there, too.
Again, not to be pedantic, common queries where nothing is modified are 
a lot less expensive. Obviously Facebook is a completely different 
animal, i mean, jeez, they practically have the population of the earth 
on-line. That's not what I'm talking about, a "dating site" is an 
interesting case.

Lets break it down, how many members? How many members on-line at any 
time? What are the usage patters? Let's assume you use PHP with some 
sort of distributed session cache mechanism. I'm using a system I 
designed for DMN a few years back as the basis of my response.

A login, yes one transaction two disk I/O operations, retrieval of data 
and storage of change of state. The login also brings the user's 
information into the page cache of the database, so most of the 
subsequent select operations that deal with deal directly with user info 
will be hitting the page cache and avoiding the disk. The first page 
view will always be the hardest, that brings the most data into the 
various caching subsystems (php session cache, any memcache object 
definitions, and, of course, database disk page buffers.) Subsequent 
page views as the user uses the system, tend to hit the page cache and 
only alterations of data and retrieval of data not currently in the page 
cache force I/O operations.

The efficacy of the page cache is directly proportional to the number of 
members in total, and the number of members on-line and active at any 
one time. You aren't going to have a thousand users synchronizing their 
watches and pressing a button at the exact same time. So, unless your 
load forces the database to read disk pages for every single select 
operation because the variability of the requests exceeds your page 
cache, user actions (multiple disk I/Os per page) can be averaged out 
over a number of page views.
> Now, a clever system will aggregate the most often requested data into
> a single row, when it can... that helps with the second page, but not
> with profile pages or a mail-analogue system. And a clever system will
> cache as much frequently-requested data as possible.
Many databases do this transparently, but yes, systems like memcached 
can be very helpfull.
> But then there's Facebook, and anything even vaguely like it.  The obvious
> first thing to do is to optimize time at the cost of storage: store a
> copy of the new message with each of the recipients, so that when they
> read their update page, that takes one transaction instead of as many
> as they have friends. And pretty quickly you run out of disk space. So
> you decide that you have a read-mostly environment, and have many, many
> read-only slave databases to handle lookups, and most of those lookups
> aren't needing anything near the power of SQL -- really, you could get
> away with a key-value lookup, if only it were FAST -- and before you
> know it, you've re-invented MongoDB.
Facebook, like the video I posted, Facebook *has* to implement these 
things and for good reason, the scale at which they operate is HUGE. 
Unbelievably huge. But, it took them several years to get where they are 
and, like I was saying, all but the most busy web sites would well be 
handled by a standard database.

Yes, if you are facebook, you have a different set of problems. Hell, if 
you are L.L. Bean you have a different set of problems. However, 99% of 
the web sites out there will never have those problems. As the saying 
goes, "Let's hope we have to address these problems later."

> By way of contrast, I know there are sites which do real work on
> data people care about in sophisticated ways -- and they
> generally use full SQL databases, because that's what they need.
> SQL and ACID solve some tough problems. Key/value systems solve
> some speed issues, as do caches. There are lots of solutions
> because there are lots of problems.

My issue with the NoSQL mentality is the idea of "no." You can't have a 
conversation when it starts with "no." Then there is always the refrain: 
"Use the right tool for the job" which is a euphemism for limiting your 
options and usually the defense of the indefensible.  Well, more often 
than not you can't ever really know what the right tool is or the level 
of sophistication required at the beginning of a project, especially 
with AGILE type development. So, you start down the road myopically 
focusing on one aspect of the project "scaling" whilst completely 
ignoring vast parts of the project, like, what are you going to do with 
the data once you get it? What MIGHT you WANT to do with the data? Sure, 
you're covered for the next 5 years of (wishful thinking) scaling at the 
expense of accessing your data.

Most all of the NoSQL solutions are dreadfully lacking in any sort of 
capability to allow you to capitalize on your data beyond mere 
regurgitation to a web page. Your data will never become "information" 
and information is where the money comes from.

My general rule of thumb is, "it is better to have it and not need it, 
than to need it and not have it." So, I would say, the question 
shouldn't be "Do I need a SQL database" it should be "Are there areas 
where a SQL database doesn't work?"


BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!

Boston Linux & Unix /