Home
| Calendar
| Mail Lists
| List Archives
| Desktop SIG
| Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings Linux Cafe | Meeting Notes | Blog | Linux Links | Bling | About BLU |
I'm forwarding a portion of Moshe Bar's most recent newletter, because I thought some people might find this interesting. If you don't know who Moshe is, you might like to check out his web site, maybe subscribe to his newsletter. He's someone who's very deep into Linux (he just wrote a book about Linux kernel internals) who likes to write a lot. He also seems to thrive without sleep. I don't always know what he's talking about, but I pick up a little here and there. Anyway, enough of my blather... <snip> Moshe Bar's Opinion - http://www.moelabs.com Dear Readers There are not many joys left to us these days. Becoming a parent, passing an important exam, getting married, getting a much sought after job are among these. One more is holding for the first time your newly published book in your hands. That, namely, is what happened to me yesterday. After countless hours, days, months of hard work, of reading, re-reading, checking and re-checking and modifying word after word, two hundred thousand of them, mind you, you think it is all a hopeless mass of words with no relevant significance to anyone but the author. Therefore, the author never beliefs it will ever see print. But, boy did it see print! Big time! For the newcomers to this list, I am talking about my newly published book "Linux Internals" by McGraw-Hill, available in all fine bookstores and on all the important online book catalogues. The book talks about the Linux kernel and how it works. It explains what an operating system does and why Linux does it better than most. It shows what a program is and how the OS makes many of them run at the same time, without ever getting confused. It explains how virtual memory works in a real OS like UNIX (of which Linux is a derivate) and how data is stored on disks and retrieved efficiently and securely. All this may sound very superficial. Suffice to say that in one of the chapters, for instance, I take an interrupt apart as it enters the Linux kernel and show step by step what exactly happens in the serving of it. Also, in each chapter I show how to make modifications to your favorite kernel, such as giving it CPU affinity, or implementing very advanced virtual paging algorithms. If you always wanted to know how an OS, and Linux in particular works, but were too afraid to ask, then go grab this book and read it. You'll understand after. But why am I telling you all this? Because I am one to keep promises. If you remember, I promised I would get you, the savvy readers of this mailing list, a way to order my book (and possibly the next one, coming this fall) with a big discount. The fine folks at my publisher, McGraw-Hill, agreed to my proposal and furthermore went on to create a web site for you guys and gals to go order my book from at a discount of .... yes.... ladies and gentleman, 20% discount. That is 20 as in twice 10% discount for a book being sold in the book store at US$ 49.99. So, for a very reasonable investment of US$ 39.99 you will get access to everything there is to know about the Linux kernel, including the upcoming version 2.4. The foreword is by Ingo Molnar, the number 3 in the Linux developers food chain. Enough said, if you want to get my book at a good discount, then you can do so from this special site at McGraw-Hill: http://www.mhorder.com/linuxinternals.html Let's now get back to the usual issues of this mailing list. These days I am deeply infatuated with Linux clusters. Out of necessity, some might argue. After all I am writing a book on the subject. But it is more than that. There is definitely something fascinating about creating a supercomputer out of a bunch of cheap PC. You can create a supercomputer with clustered Linux boxes each costing around two to three thousand US ducats. A MOSIX cluster of Linux machines is what is called a Single System Image cluster (SSI). I have variously said and written that you don't have a real cluster until you have an SSI cluster. The first SSI clusters were the IBM SysPlex cluster and the DEC cluster. In a DEC cluster you could telnet to cluster.address and you would end up in one of the nodes of the cluster. You, the user, would not care on which node you run and any program you launched would be executed on the node which best could serve the program's needs. Well, Mosix is exactly the same. Only, it runs on our favorite OS, Linux. Mosix, is an extension of the kernel. Therefore to install Mosix, you have to launch an appropriate installation script that applies all the changes to the kernel sources (you do have your kernel's sources on your disk, don't you). The changes make about 3% of all the source code of the kernel, so it's really not that much. Once you re-compiled your kernel and booted with it you have a working Mosix node of a cluster. Replicating this work among all your Linux machines in your network, will create a Mosix cluster. Notice that Mosix is available for all 2.2.x kernels as well as for a the preliminary versions of 2.4. There is one configuration file, /etc/mosix that has to configured to let the local host see all other Mosix nodes in the cluster. That is very easy to do, much like adding hosts to the /etc/hosts file. Once you have done that, the nodes in the Mosix cluster start talking to each other. The way they do that is that every node sends to a random list of other nodes its current load status, ie it will tell some other nodes how busy it currently is. Why does it not tell everybody in the Mosix cluster? Good question, eh? Because Mosix tries to schedule newly launched programs in the node with lowest current load. If the machine with the lowest load level announces itself to ALL the nodes in the cluster, then everybody would try to migrate newly launched jobs to that node with the lowest load. Soon that node would end up being thrown at all the jobs in the cluster. The nice thing about Mosix is that you just launch a program and the cluster decides (unless told otherwise) where to run it. The inventors of Mosix, call it "fork-and-forget". And here it shows where Mosix spent good part of its life, in the air force. Mosix, developed by Prof. Amnon Barak (no apparent link to Israel's prime minister Ehud Barak) at the Hebrew University of Jerusalem, was namely in part financed by a DARPA grant and was in the eighties used by the US Air Force to cluster its PDP11/45 computers. Lately, Mosix, was given a new algorithm to select on which node a given program should run. The mathematical model for this scheduling algorithm comes from the field of economics research. Determining the optimal location for a job is a complicated problem. The most important complication is that the resources available on a cluster of Linux computers are heterogeneous. In effect, the costs for memory, CPU, process communication and so forth are incomparable. They are not even measured in the same units. Communication resources are measured in terms of bandwidth, memory in terms of space, and CPU in terms of cycles. The natural greedy strategy, balancing the resources across all of the machines, is not even well defined. The new algorithm employed by MOSIX is very interesting because it tries to reconcile these differences (and maybe it could be applied to non-cluster schedulers as well) based on economic principles and competitive analysis. The key idea of this strategy is to convert the total usage of several heterogeneous resources, such as memory and CPU, into a single homogenous "cost". Jobs are then assigned to the machine where they have the lowest cost. Just like in a market-oriented economy. This economic strategy provides a unified algorithm framework for allocation of computation, communication, memory and I/O resources. It allows the development of near-optimal online algorithms for allocation and sharing these resources. The mathematical model to not trivial, but it is certainly coherent. During my long testing period on the Hebrew University Mosix cluster, I was able to prove that this new scheduling algorithm is actually faster than the previous semi-dynamic one, by a factor of up to 27%. This is indeed an important invention and I will try to apply its ideas to the Linux SMP (symmetrical multi-processing) scheduler one of these days. But that is not all that comes out of the bright minds at Hebrew U. One other major hassle of SSI clusters is that each node has to be able to see the file systems of each other node. Why? Because if I launch a program that open file /tmp/moshe for read/write, and then this jobs gets migrated to another node of the cluster, it has to be able to continue doing I/O to and from this file. There were until now two options for doing so. Either the MOSIX cluster intercepts all I/Os of jobs that were migrated to the current host from another node and sends those requests for I/O to that originating node. Or, you could create a global vision of a file system through NFS. The first one is more difficult to develop, but easier to maintain in the day to operations. The second is easier to implement, but you have the head-ache to mount all the file system in an intelligent manner, allowing every node access to all the nodes. Additionally, you have to make sure that all userID and groupIDs are consistent among all the nodes in the cluster, otherwise serious permission problems would arise. Until now, MOSIX, supported both options. But now, it came out with a new Cluster File System for Linux which gives a shared cluster-wise view of all the file systems. Prof. Barak and his assistants (some of his students got an M.Sc. degree at the age of 17!) saw that all current solutions for a cluster-wide file system relied on a central file server, but that some new file system technologies were being developed addressing the very needs of a single system image cluster (SSI) like Mosix. Among them are probably most prominently the Global File System (www.gfs.org) and Frangipani (http://research.Compaq.com/SRC/projects). Taking the best of modern file system research and applying it to Mosix, Prof. Barak came out with DFSA (Direct File System Access). The DFSA file system was designed to reduce the extra overhead of executing I/O oriented system-calls of a migrated process. This was done by allowing the execution of most such system-calls locally - in the process?s current node. In addition to DFSA, a new algorithm that takes into account I/O operation was added to the Mosix process distribution (load-balancing) policy. The outcome of these provisions is that a process that performs moderate to high volume of I/O is encouraged to migrate to the node in which it does most of its I/O. one obvious advantage is that I/O-bound processes have greater flexibility to migrate from their respective home-nodes for better load-balancing. So, unlike all existing network file system (say, NFS) which bring the data from the file server to the client node over the network, the Mosix cluster attempts to migrate the process to the node in which the file actually resides. During the testing we did at Hebrew U., we saw that DFSA and the Mosix load-balancing algorithm always outperform NFS as the clustering file system, by a minimum of 14% to a maximum of 80% . The tests were conducted in a variety of environments with small block sizes, and large block sizes. Lessons ---------- If you use Linux servers for and you need to significantly improve the power of the server without spending money for an expesinve 4-way or 8-way server, Mosix can certainly help you. It is here today. It is VERY easy to install and configure. It is proven technology, used by industries, government agencies and commercial companies the world over. Forget the various non-sense cluster products for Linux in the market. Some of them either not really working or are basically just a bunch of scripts crafted together by a somewhat appealing GUI. As far as I can see, Mosix is the only true SSI cluster available for Linux today. And it works </snip> -- Ron Peterson Systems Manager Wallace Floyd Design Group 273 Summer Street Boston, MA 02210 617.350.7400 tel 617.350.0051 fax rpeterson at wallacefloyd.com - Subcription/unsubscription/info requests: send e-mail with "subscribe", "unsubscribe", or "info" on the first line of the message body to discuss-request at blu.org (Subject line is ignored).
BLU is a member of BostonUserGroups | |
We also thank MIT for the use of their facilities. |