BLU Discuss list archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
MOSIX clusters

Subject: MOSIX clusters
From: rpeterson at wallacefloyd.com (Ron Peterson)
Date: Thu, 03 Aug 2000 10:16:39 -0400
I'm forwarding a portion of Moshe Bar's most recent newletter, because I
thought some people might find this interesting.  If you don't know who
Moshe is, you might like to check out his web site, maybe subscribe to
his newsletter.  He's someone who's very deep into Linux (he just wrote
a book about Linux kernel internals) who likes to write a lot.  He also
seems to thrive without sleep.  I don't always know what he's talking
about, but I pick up a little here and there.  Anyway, enough of my
blather...

<snip>

Moshe Bar's Opinion - http://www.moelabs.com

Dear Readers

There are not many joys left to us these days. Becoming a parent,
passing
an important exam, getting married, getting a much sought after job are
among these. One more is holding for the first time your newly published
book in your hands. 

That, namely, is what happened to me yesterday. After countless hours,
days, months of hard work, of reading, re-reading, checking and
re-checking and modifying word after word, two hundred thousand of them,
mind you, you think it is all a hopeless mass of words with no relevant
significance to anyone but the author. Therefore, the author never
beliefs
it will ever see print. 

But, boy did it see print! Big time!

For the newcomers to this list, I am talking about my newly published
book
"Linux Internals" by McGraw-Hill, available in all fine bookstores and
on
all the important online book catalogues. 

The book talks about the Linux kernel and how it works. It explains what
an operating system does and why Linux does it better than most. It
shows
what a program is and how the OS makes many of them run at the same
time,
without ever getting confused. It explains how virtual memory works in a
real OS like UNIX (of which Linux is a derivate) and how data is stored
on
disks and retrieved efficiently and securely. 

All this may sound very superficial. Suffice to say that in one of the
chapters, for instance, I take an interrupt apart as it enters the Linux
kernel and show step by step what exactly happens in the serving of it.
Also, in each chapter I show how to make modifications to your favorite
kernel, such as giving it CPU affinity, or implementing very advanced
virtual paging algorithms. 

If you always wanted to know how an OS, and Linux in particular works,
but
were too afraid to ask, then go grab this book and read it. You'll
understand after. 

But why am I telling you all this?

Because I am one to keep promises. If you remember, I promised I would
get
you, the savvy readers of this mailing list, a way to order my book (and
possibly the next one, coming this fall) with a big discount. 

The fine folks at my publisher, McGraw-Hill, agreed to my proposal and
furthermore went on to create a web site for you guys and gals to go
order
my book from at a discount of ....


yes.... ladies and gentleman, 20% discount. 

That is 20 as in twice 10% discount for a book being sold in the book
store at US$ 49.99. So, for a very reasonable investment of US$ 39.99
you
will get access to everything there is to know about the Linux kernel,
including the upcoming version 2.4. The foreword is by Ingo Molnar, the
number 3 in the Linux developers food chain.

Enough said, if you want to get my book at a good discount, then you can
do so from this special site at McGraw-Hill:
http://www.mhorder.com/linuxinternals.html
Let's now get back to the usual issues of this mailing list.

These days I am deeply infatuated with Linux clusters. Out of necessity,
some might argue. After all I am writing a book on the subject. But it
is
more than that. There is definitely something fascinating about creating
a
supercomputer out of a bunch of cheap PC. You can create a supercomputer
with clustered Linux boxes each costing around two to three thousand US
ducats. 

A MOSIX cluster of Linux machines is what is called a Single System
Image
cluster (SSI). I have variously said and written that you don't have a
real cluster until you have an SSI cluster. 

The first SSI clusters were the IBM SysPlex cluster and the DEC cluster.
In a DEC cluster you could telnet to cluster.address and you would end
up
in one of the nodes of the cluster. You, the user, would not care on
which
node you run and any program you launched would be executed on the node
which best could serve the program's needs. 

Well, Mosix is exactly the same. Only, it runs on our favorite OS,
Linux. 

Mosix, is an extension of the kernel. Therefore to install Mosix, you
have
to launch an appropriate installation script that applies all the
changes
to the kernel sources (you do have your kernel's sources on your disk,
don't you). The changes make about 3% of all the source code of the
kernel, so it's really not that much. 

Once you re-compiled your kernel and booted with it you have a working
Mosix node of a cluster. Replicating this work among all your Linux
machines in your network, will create a Mosix cluster. Notice that Mosix
is available for all 2.2.x kernels as well as for a the preliminary
versions of 2.4.

There is one configuration file, /etc/mosix that has to configured to
let
the local host see all other Mosix nodes in the cluster. That is very
easy
to do, much like adding hosts to the /etc/hosts file. 

Once you have done that, the nodes in the Mosix cluster start talking to
each other. The way they do that is that every node sends to a random
list
of other nodes its current load status, ie it will tell some other nodes
how busy it currently is. 

Why does it not tell everybody in the Mosix cluster? Good question, eh?

Because Mosix tries to schedule newly launched programs in the node with
lowest current load. If the machine with the lowest load level announces
itself to ALL the nodes in the cluster, then everybody would try to
migrate newly launched jobs to that node with the lowest load. Soon that
node would end up being thrown at all the jobs in the cluster.

The nice thing about Mosix is that you just launch a program and the
cluster decides (unless told otherwise) where to run it. The inventors
of
Mosix, call it "fork-and-forget". And here it shows where Mosix spent
good
part of its life, in the air force. Mosix, developed by Prof. Amnon
Barak
(no apparent link to Israel's prime minister Ehud Barak) at the Hebrew
University of Jerusalem, was namely in part financed by a DARPA grant
and
was in the eighties used by the US Air Force to cluster its PDP11/45
computers. 

Lately, Mosix, was given a new algorithm to select on which node a given
program should run. The mathematical model for this scheduling algorithm
comes from the field of economics research. Determining the optimal
location for a job is a complicated problem. The most important
complication is that the resources available on a cluster of Linux
computers are heterogeneous. In effect, the costs for memory, CPU,
process
communication and so forth are incomparable. They are not even measured
in
the same units.  Communication resources are measured in terms of
bandwidth, memory in terms of space, and CPU in terms of cycles. The
natural greedy strategy, balancing the resources across all of the
machines, is not even well defined. The new algorithm employed by MOSIX
is
very interesting because it tries to reconcile these differences (and
maybe it could be applied to non-cluster schedulers as well) based on
economic principles and competitive analysis. 

The key idea of this strategy is to convert the total usage of several
heterogeneous resources, such as memory and CPU, into a single
homogenous
"cost". Jobs are then assigned to the machine where they have the lowest
cost. Just like in a market-oriented economy. 

This economic strategy provides a unified algorithm framework for
allocation of computation, communication, memory and I/O resources. It
allows the development of near-optimal online algorithms for allocation
and sharing these resources. 

The mathematical model to not trivial, but it is certainly coherent.
During my long testing period on the Hebrew University Mosix cluster, I
was able to prove that this new scheduling algorithm is actually faster
than the previous semi-dynamic one, by a factor of up to 27%. 

This is indeed an important invention and I will try to apply its ideas
to
the Linux SMP (symmetrical multi-processing) scheduler one of these
days.

But that is not all that comes out of the bright minds at Hebrew U. 

One other major hassle of SSI clusters is that each node has to be able
to
see the file systems of each other node. Why? Because if I launch a
program that open file /tmp/moshe for read/write, and then this jobs
gets
migrated to another node of the cluster, it has to be able to continue
doing I/O to and from this file. 

There were until now two options for doing so. Either the MOSIX cluster
intercepts all I/Os of jobs that were migrated to the current host from
another node and sends those requests for I/O to that originating node.
Or, you could create a global vision of a file system through NFS.

The first one is more difficult to develop, but easier to maintain in
the
day to operations. The second is easier to implement, but you have the
head-ache to mount all the file system in an intelligent manner,
allowing
every node access to all the nodes. Additionally, you have to make sure
that all userID and groupIDs are consistent among all the nodes in the
cluster, otherwise serious permission problems would arise. 

Until now, MOSIX, supported both options. But now, it came out with a
new
Cluster File System for Linux which gives a shared cluster-wise view of
all the file systems. Prof. Barak and his assistants (some of his
students
got an M.Sc. degree at the age of 17!) saw that all current solutions
for
a cluster-wide file system relied on a central file server, but that
some
new file system technologies were being developed addressing the very
needs of a single system image cluster (SSI) like Mosix. Among them are
probably most prominently the Global File System (www.gfs.org) and
Frangipani (http://research.Compaq.com/SRC/projects).

Taking the best of modern file system research and applying it to Mosix,
Prof. Barak came out with DFSA (Direct File System Access). The DFSA
file
system was designed to reduce the extra overhead of executing I/O
oriented
system-calls of a migrated process. This was done by allowing the
execution of most such system-calls locally - in the process?s current
node. In addition to DFSA, a new algorithm that takes into account I/O
operation was added to the Mosix process distribution (load-balancing)
policy. The outcome of these provisions is that a process that performs
moderate to high volume of I/O is encouraged to migrate to the node in
which it does most of its I/O. one obvious advantage is that I/O-bound
processes have greater flexibility to migrate from their respective
home-nodes for better load-balancing. 

So, unlike all existing network file system (say, NFS) which bring the
data from the file server to the client node over the network, the Mosix
cluster attempts to migrate the process to the node in which the file
actually resides. 

During the testing we did at Hebrew U., we saw that DFSA and the Mosix
load-balancing algorithm always outperform NFS as the clustering file
system, by a minimum of 14% to a maximum of 80% . The tests were
conducted
in a variety of environments with small block sizes, and large block
sizes. 



Lessons
----------

If you use Linux servers for and you need to significantly improve the
power of the server without spending money for an expesinve 4-way or
8-way
server, Mosix can certainly help you. It is here today. It is VERY easy
to
install and configure. It is proven technology, used by industries,
government agencies and commercial companies the world over. 

Forget the various non-sense cluster products for Linux in the market.
Some of them either not really working or are basically just a bunch of
scripts crafted together by a somewhat appealing GUI. As far as I can
see,
Mosix is the only true SSI cluster available for Linux today. And it
works

</snip>


-- 


Ron Peterson
Systems Manager
Wallace Floyd Design Group
273 Summer Street
Boston, MA  02210
617.350.7400 tel
617.350.0051 fax
rpeterson at wallacefloyd.com
-
Subcription/unsubscription/info requests: send e-mail with
"subscribe", "unsubscribe", or "info" on the first line of the
message body to discuss-request at blu.org (Subject line is ignored).
Prev by Date: Status of BLU Server
Next by Date: changing gatewaydev on notebook
Previous by thread: Status of BLU Server
Next by thread: help samba
Index(es):
- Date
- Thread
Boston Linux & Unix / webmaster@blu.org