From cesnj@ces.iisc.ac.in Sat Jan 24 07:34:59 1998
Subject: LINUX: Supercomputing & Linux..
I'm Not Going to Pay a Lot for This Super Computer
[An article by Jim Hill, Michael Warren and Patrick Goda in the
December '97 issue of Linux Journal.]
See also the text version.
[INLINE]
Figure 1. Parallel Linux Cluster Logo
[INLINE]
Los Alamos National Laboratory and Caltech obtain gigaflops
performance on parallel Linux machines running free software and built
from commodity parts costing less than $55,000 each (in September
1996). Now, you can probably build a similar machine for about
$25,000.
As much as we might like to own a supercomputer, high cost is still a
deterrent. In a market with almost no economy of scale, buyers find
themselves relying on the vendors for specialty hardware, specialty
software and expensive support contracts while hoping that the vendors
don't join the long list of bankrupt former supercomputer vendors. The
limited number of sale opportunities force vendors to try satisfying
all customers, with the usual result that no one is really happy.
There is simply no way to provide highly specialized software (such as
a parallelizing compiler) and simultaneously keep costs out of the
stratosphere.
On the other end of the market, however, sits the generic buyer. More
correctly, tens of millions of generic buyers, all spending vast sums
for fundamentally simple machines with fundamentally simple parts.
What the vendors lose in profit margin, they make up for in volume.
The result? Commodity computer components are increasingly faster,
cheaper and smaller. It is now possible to take these off-the-shelf
parts and assemble machines which run neck-and-neck with the "big
boys" of supercomputing, and in some instances, surpass them.
Why Mass-Market Components?
Intel's x86 series of processors, especially the Pentium and Pentium
Pro, offer excellent floating-point performance at ever-increasing
clock speeds. The recently released Pentium II has a peak clock speed
of 300 MHz, while Digital's best Alpha processors compute merrily
along at 500 MHz and higher.
The PCI bus allows the processors to communicate with peripherals at
rates in excess of 100MB/sec. Because it is a processor-independent
bus, undertaking processor upgrades (e.g., from the Pentium Pros to
500MHz DEC Alphas) requires replacing only the processors and
motherboards. Further, parts replaced by an upgrade can be expected to
have a significant resale value.
The development of Fast Ethernet technology makes possible
point-to-point communication in excess of 10MB/sec. Switches which
allow multiple machines to use this bandwidth in full are readily
available, which gives the Beowulf-class (see below) machine a
bandwidth and latency which rivals the larger IBM SP-2 and the
Thinking Machines CM-5. While the Beowulf machines don't yet scale
easily to hundreds of processors, their performance in smaller
networks of 16 or 32 processors is outstanding.
Free (and well-written) Software
The Linux operating system is robust, largely POSIX-compliant and
available to varying degrees of completeness for Intel x86, DEC Alpha
and PowerPC microprocessors. Thanks to the untiring efforts of its
legions of hackers, auxiliary hardware (network and disk drivers) is
supported almost as soon it becomes available and the occasional bug
is corrected when found, often the same day. GNU's compilers and
debuggers coupled with free message-passing implementations make it
possible to use Linux boxes for parallel programming and execution
without spending money on software.
Beowulf
The Beowulf Project studies the advantages of using interconnected PCs
built from mass-market components and running free software. Rather
than raw computational power, the quantities of interest derive from
the use of these mass-market components: performance/price,
performance/processor and so on. They provide an informal
"nonstandard" by loosely defining a "Beowulf-class" machine. Minimal
requirements are:
* 16 motherboards with Intel x86 processors or equivalent
* 256MB of DRAM, 16MB per processor board
* 16 hard disk drives and controllers, one per processor board
* 2 Ethernets (10baseT or 10base2) and controllers, 2 per processor
* 2 high resolution monitors with video controllers and 1 keyboard
The Beowulf-class idea is not so much to define a specific system than
to provide a rough guideline by which component improvement and
cross-platform Linux ports can be compared. Several Beowulf-class
machines are in use throughout the United States, including Loki in
the Los Alamos National Laboratory's Theoretical Astrophysics group
and Hyglac at Caltech's Center for Advanced Computing Research.
@lay:place 2392f3.gif (or 2392f6.gif) and 2392f4.gif around here
@fig:Figure 2. View of Loki from the Front @fig:Figure 3. View of Loki
from the Back
Loki at LANL
Loki is a 16-node parallel machine with 2GB RAM and 50GB disk space.
Most of the components were obtained from Atipa International
(www.atipa.com). Each node is essentially a Pentium Pro computer
optimized for number crunching and communication:
* (1)Intel Pentium Pro 200MHz CPU with 256K integrated L2 cache
* (1) Intel VS440FX (Venus) motherboard, 82440FX (Natoma) chip set
* (4) 8x36 60ns parity SIMMS (128MB per node)
* (1) Quantum Fireball 3240MB IDE Hard Drive
* (1) Cogent EM400 TX PCI Quartet Fast Ethernet Adapter
* (1) SMC EtherPower 10/100 Fast Ethernet PCI Network Card
* (1) S3 Trio-64 1MB PCI Video Card
The list purchase price of Loki's parts in September 1996 was just
over $51,000.
The nodes are connected to one another through the four-port Quartet
adapters into a fourth-degree hypercube. Each node is also connected
via the SMC adapter to one of two eight-port 3Com SuperStack II Switch
3000 TX 8-port Fast Ethernet switches, which serve the dual purpose of
bypassing multi-hop routes and providing a direct connection to the
system's front end, a dream machine with the following components:
* (2) Intel Pentium Pro 200MHz CPU with 256K integrated L2 cache
* (1) ASUS P/I-P65UP5 dual CPU motherboard, Natoma chip set
* (8) 16x36 60ns parity SIMMS (512MB)
* (6) Quantum Atlas 4.3GB UltraSCSI Hard Drive
* (1) Adaptec 2940UW PCI Fast Wide SCSI Controller
* (1) Cogent EM400 TX PCI Quartet Fast Ethernet Adapter
* (1) SMC EtherPower 10/100 Fast Ethernet PCI Network Card
* (1) Matrox Millennium 4MB PCI Video Card
* (1) 21 inch Nokia 445X Monitor
* (1) Keyboard, Mouse, Floppy Drive
* (1) Toshiba 8x IDE CD-ROM
* (1) HP C1533A DAT DDS-2 4GB Tape Drive
* (1) Quantum DLT 2000XT 15GB Tape Drive
It is also possible for the nodes to communicate exclusively through
their SMC-SuperStack connections as a fast, switched array topology.
At Supercomputing '96, Loki was connected to Caltech's Hyglac and the
two were run as a single fast switched machine.
Hyglac at Caltech
Like Loki, Hyglac is a 16-node Pentium Pro computer with 2GB RAM. At
the time of its construction, it had 40GB disk space, though that has
since been doubled by adding a second hard drive of the type listed
below to each node.
* (1) Intel Pentium Pro 200 MHz CPU with 256K integrated L2 cache
* (1) Intel VS440FX (Venus) motherboard, 82440FX (Natoma) chip set
* (4) 8x32 60ns EDO SIMMS (128MB per node)
* (1) Western Digital 2.52GB IDE Hard Drive
* (1) D-Link DFE-500 TX 100MB Fast Ethernet PCI Card
* (1) VGS-16 512K ISA Video Card
Each node is connected to a 16-way Bay Networks 28115 Fast Ethernet
Switch in a fast switched topology. Video output is directed to a
single monitor through switches; the node which is directly connected
to the monitor also supports a second Ethernet card and a floppy
drive. The list purchase price of Hyglac in September 1996 was just
over $48,500. Most of the components have since decreased in price by
about 50%, and the highest single-cost item (a 16-port Fast Ethernet
Switch) can now be obtained for less than $2500!
Software on Loki and Hyglac
Both Loki and Hyglac run Red Hat Linux on all nodes, with GNU's gcc
2.7.2 as the compiler. The 200MHz Pentium Pros that drive both systems
supply a real-time clock with a 5 nanosecond tick, providing precise
timing for message passing. More advanced timing and counting routines
are available as well, so that profiling data like cache hits and
misses are directly supported. A relatively simple interface to the
hardware performance monitoring counters on the CPU has been developed
at LANL called perfmon, which is available at the Loki URL listed in
the Resources. Internode communication is accomplished via the Message
Passing Interface (MPI). While multiple implementations of MPI are
freely available, none was specifically written to take advantage of a
Fast Ethernet-based system and, as usual, maximum portability leads to
a decidedly less than maximum efficiency. Accordingly, a minimal
implementation was written from scratch which incorporated the 20 or
so most common and basic MPI functions. This specialized MPI library
runs the treecode discussed in the next section as well as the NAS
parallel benchmarks for Version 2 MPI, while nearly doubling the
message bandwidth obtained from the LAM (Ohio State's version of MPI)
and MPICH (from Argonne National Laboratory and Mississippi State
University) implementations.
Treecode
Because of its use in astrophysics, Loki has been used to compute
results for an N-body, gravitational-interaction problem using a
parallelized hashed oct-tree library. (oct-tree is a three-dimensional
tree data structure, where each cubical cell is recursively divided
into eight daughter cells. treecode is a numerical algorithm which
uses tree data structures to increase the efficiency of N-body
simulations. For details on the treecode, see the URL listed in
Resources.) The code is not machine-specific, so comparing the
performance of the commodity machines to traditional supercomputers is
free of porting issues (with the exception that the Intel i860 and the
Thinking Machines CM-5 have an inner loop coded in assembly). At
Supercomputing '96, Loki and Hyglac were connected via $3,000 of
additional Ethernet cards and cables) to perform as a single 32-node
machine with a purchase cost of just over $100,000. Running the N-body
benchmark calculation with 10 million particles, Loki+Hyglac achieved
2.19 billion floating-point operations per second (GFLOPS), more than
doubling the per-processor performance of a Cray T3D and almost
matching that of an IBM SP-2 (see Table 1).
Table 1. Treecode Performance on Selected Architectures
MACHINEProcessors Time,s GFLOPS MFLOPS/proc
ASCI Red 1408 27.82 96.53 68.5
TMC CM-5 512 140.7 14.06 27.5
Intel Paragon 512 144.4 13.70 26.8
TMC CM-5E 256 171.0 \11.57 45.2
Intel Delta 512 199.3 10.02 19.6
IBM SP-2 128 281.9 9.52 74.4
Cray T3D 256 338.0 7.94 31.0
SGI Origin 2000 24 394.2 5.02 209
Loki+Hyglac 32 1218 2.19 68.4
Loki 16 2102 1.28 80.0
[INLINE]
Figure 5. Intermediate stage of a gravitation N-body simulation of
galaxy formation using 9.75 million particles. It took about three
days to compute on Loki at a sustained rate of about 1GFLOP
As a stand-alone machine at LANL, Loki has performed an N-body
calculation with just over 9.75 million particles. This calculation
was "real work" and not "proof-of-principle", so it was tuned to
optimize scientific results rather than machine performance. Even with
that condition, the performance and results are striking. The total
simulation required 10 days (less a few hours) to step through 750
time steps, performed 6.6x1014 floating-point operations to compute
1.97x1013 particle interactions and produced just over 10GB of output
data.
For the entire simulation, Loki achieved an average of 879MFLOPS,
yielding a price/performance figure of $58/MFLOP. Contemporary
machines such as SGI's Origin are capable of price/performance in this
range, but scaling an Origin to the memory and disk necessary to
perform a calculation of this magnitude quickly becomes prohibitive;
at list price, 2GB of Origin's memory alone costs more than the entire
Loki assembly.
The nature of the treecode is such that later time steps have greater
overhead in spanning the tree than in performing floating-point
arithmetic, so the average flop rate steadily decreases the longer the
code is run. When the first 30 time steps of the simulation are taken
into consideration, 1.15x1012 particle interactions in 10.25 hours
provide a throughput of 1.19 GFLOPS. This figure actually is a better
estimate of the amount of useful work than that given for the total
simulation, since the treecode's purpose is to avoid floating-point
calculations whenever possible.
Vortex-Particle
Loki has also been used to simulate the fusion of two vortex rings.
The simulation began with 57,000 vortex particles in two discrete
smoke rings, though re-meshing caused the simulation to be tracking
360,000 particles by the final time step. Each processor sustained
just over 65 MFLOPS during the simulation for a total system
performance of 950 MFLOPS.
Photo-realistic Rendering
Hyglac has been used to perform photo-realistic rendering using a
Monte Carlo implementation of the rendering equation. Images of some
of the rendered images are available at
http://www.cacr.caltech.edu/research/beowulf/rendering.html.k=imageprocessing&style=p&transparent=on&increase=1 In a
direct comparison with an IBM SP-2, Hyglac completed the renderings
anywhere from 12% to 20% faster than an IBM SP-2, a machine with a
price tag twenty times that of Hyglac.
System Reliability
Even the most blazingly fast system is useless if it can't perform
without crashing. System reliability is therefore crucial, especially
in the case of a machine like Loki which may need several days without
interruption to complete a large-scale calculation. During the burn-in
period, a bad SIMM and a handful of bad hard drives were replaced
under their warranty terms. The warranties on commodity parts make
these commodity supercomputers particularly appealing. Warranties on
specialty machines like the Origin tend to be 90 days or less, whereas
readily available parts such as Loki's innards generally have
warranties ranging from a year to life. In September 1997, most of the
Loki nodes had uptimes of over 4 months without a reboot. The only
hardware problems encountered have been three ATX power supply fans
which failed, resulting in node shutdowns due to overheating. Those
nodes were easily swapped with a spare, and the fans replaced in a few
minutes.
Price-to-Performance
In Table 2, we summarize the price/performance of several machines
capable of running the NAS (Numerical Aerospace Simulation Facility at
NASA Ames Research Center) Class B benchmarks: Loki, the SGI Origin
2000, the IBM SP-2 P2SC and the DEC AlphaServer 8400/440.
Table 2. Price/Performance Summary
Loki Origin SP-2 AlphaServer
Price Nov. 96K$ 55 960 3520 580
Processors 16 26 648
NPB 2.2B Time 5049 957 304
NPB Price/Perf 1.0 3.3 3.8
SPECint Price/Perf 1.0 7.0 16.0 9.5
SPECfp Price/Perf 1.0 2.6 4.2 5.8
Stream Price/Perf 1.0 5.31.8 12.0
Conclusions
A gravitational N-body simulation won LANL's Michael Warren and
Caltech's John Salmon a Gordon Bell Performance Prize in 1992. A scant
five years later, that same calculation can be run on a $50,000
machine. Technology continues to advance (Warren and Salmon recently
achieved 170 sustained GFLOPS while running the N-body code with over
320 million particles on half of the nearly 10,000 processors of the
Teraflops "ASCI Red" machine at Sandia National Laboratory), but the
cost of the ever-improving "high-end" supercomputers keeps them beyond
the reach of all but a lucky few. Even those lucky few must compete
with one another for processor time in the never-ending game of
large-scale computation. Commodity parts provide an opportunity for a
handful of users to have a significant share of processor cycles on a
machine which is capable of solving enormous computational problems in
a reasonable time. Linux and the free software movement provide the
software to take full advantage of the hardware's capabilities.
Jim Hill at is a graduate research assistant at Los Alamos National
Laboratory who's thinking about renaming one of his two Linux boxes a
zeroth-degree hypercube.
Michael S. Warren at mswarren@lanl.gov received a B.S. in Physics and
in Engineering and Applied Science from the California Institute of
Technology in 1988 and a Ph.D. in Physics from the University of
California, Santa Barbara in 1994. He is currently developing treecode
algorithms for problems in cosmology and hydrodynamics as a technical
staff member in the Theoretical Astrophysics group at Los Alamos
National Laboratory. He has been involved with parallel computing
since 1986 and with Linux since 1993.
Patrick Goda at pgoda@lanl.gov is currently employed at Los Alamos
National Lab in the Theoretical Astrophysics group. When he's not
building clusters or hacking Linux software, he actually uses Linux to
do real research in Meteoritics (like smashing simulated asteroids
into a simulated Earth and seeing how grim the situation would be).
Resources
* Beowulf Project: CESDIS, NASA-Goddard Space Center,
http://cesdis.gsfc.nasa.gov/beowulfh/beowulf/rendering.html.k=imageprocessing&style=p&transparent=on&increase=1
* Loki: Los Alamos National Laboratory,
http://losdis.gsfc.nasa.gov/beowulfh/beowulf/rendering.html.k=imageprocessing&style=p&transparent=on&increase=1,i-www.lanl.gov/index.html
* Hyglac and Naegling: Center for Advanced Computing Research,
California Institute of Technology,
http://www.cacr.caltech.edu/research/beowulf/rendering.html.k=imageprocessing&style=p&transparent=on&increase=1
* The Warren-Salmon Treecode: http://qso.lanl.gov/papers/Papers.htmleowulf/rendering.html.k=imageprocessing&style=p&transparent=on&increase=1
* Grendel: Department of Electrical and Computer Engineering,
Clemson University, http://ece.clemson.edu/parl/grendel.htmowulf/rendering.html.k=imageprocessing&style=p&transparent=on&increase=1
* Hrothgar: Drexel University,
http://einstein.drexel.edu/beowulf/Beowulf.htmlndering.html.k=imageprocessing&style=p&transparent=on&increase=1