| PLEX86 | ||
IBM 610 workstation computer 3436You miss out on the sheer number of servers in use in commercial installations. A 4-way SMP (non-redundant cpu-wise, but with everything else but memory hotswappable) with 6-8 disks fits in around 2-3 inches of rackspace. It is normal for a business doing this well enough to hire separate crew to have 2-5 racks full of them. This would in total cost around half of what a well-endowed Tops10 system cost in it's heyday. In terms of user base they are huge. I have been closely involved in around 10 such installations; mostly delivering the connectivity. Douglas Hofstaedter writes that to understand (and change-improve) a system you must step out of the system. Hegel and Marx are onto the same thing with their Tese-Anbreastese stuff, but have problematised it beyond belief. Such a move happened to computing from 1987-2002, with perhaps 1995-1997 being the most important years. So step back for a minute and ponder what this means. Such a computing farm is really a supercomputing centre, even by todays measures. 100 servers with 2-15 bips each. This is way too big for a single backplane. The backplane is also the source of all the show-stopping failures. So, we stepped back, and did something about the backplane. The main backplane is not the cpu backplane anymore. Sun won that fight. "The network IS the computer". We leave the CPU'based systems alone, and let them work as well as they can, and move the failover compexity into the networks. They do this a lot better. Such a system I described is usually held together with a handful of switch fabrics. These do not switch memory accesses, but packets. These packets are either network payloads or disk data. Disk data is not really much different from network packets as seen from a switch. For technical reasons all the parallell interfaces are gone; even disks now have serial connections. It is a lot easier to make one, very fast wire than lots of synchronised somewhat fast wires. But this as the side effect of making them very easy to switch. There are probably 2-4 such "switched backplanes" via either gigabit ethernet switces, fiberchannel muxes, or other SAN-NAS switch technologies. Probably two for the networks and two for the storage, although they may be one and the same. If you want to provide a login service for the entire US population this is perfectly doable on such a setup. Each time you logged in you would physically end on a different cpu; but with your familiar home directory and environment just as normal. Gateway switches send you to a suitable server each time. These switches have failovers with a spare on hot standby. They are also important for CPU failovers, as they identify which hosts have high loads or have problems. The storage for that environment would be inside a homongous storage farm, using switches to bring the bits to the "user cpus". This is just like having your code actually execute on a different CPU each time you ran a program, but accessing the familiar place on disk, attached to a disk controller all (or at least many) cpu's can reach. Except in this case, the "controller" is switched to a thousand cpus, and it knows about the file system details as well; and the local cpu's have some local stuff as well. IBM 610 workstation computer 3442 the or virtual so a stop (until That's how the SMP biz got started but this is... As a user, you just don't care about what provides the service, as long as it works. Web, news, mail etc are in principle done the same way, but these have a huge advantage in that they carry their own state information; so the interactions look a lot more like cics sessions than terminal logons. They come, execute a transaction and go away in milliseconds; and are happy to talk to a new server the next time; just as command end running each command on a different cpu. Just think of what "the System" is. Hint: it is not just one computer anymore. In this "network IS the computer" version you tell a switch that "box x is not to provide services", wait for stuff to migrate out of that cpu, which may take from milliseconds to hours, and can replace that box at leisure. The really beautiful thing is that you can do the same thing with the switches themselves. Just set their cost as the highest in the network, and wait for stateful caches to time out. In this scenario you move the traffic away from the parts you want to service, and noone will notice. The box is also completely freed up, you can disconnect power and move it to a different location if you like. IBM 610 workstation computer 3441 part of the issue is fault isolation. also, i think jim gray, after he left sjr for tandem ... published a paper in the early 80s about... IBM 610 workstation computer 3443 As long as we agree on the definitions, I'm not too interested in what it's called. ;-) The L2 caches are available to all CPU ports... With the web you can build it completely stateless and use crude failover. An occational, literally one in a million, transaction will fail, but the reload will work. People accept small glitches, like having to reload an occational web page, or press ^l in emacs. With higher quality services you must have more servers and a little more elaborate faiover solutions, but nothing exceptionally difficult. IBM 610 workstation computer 3437 YES, I have been trying to step back and thinking about "component" fixtures. I consider an SMP system to be a component. My problem... For the really important core network gear a simple master-slave setup has proven the most resilient. The slave can also snoop on what the master does at all times, and have a ready-to-roll set of states take over. This is what a protocol like CARP does. For all the service boxes parallellism is used by having lots and lots of powerful, but small (and cheap) 1-cpu boxes. The impact of any pone of them failing is small. This is where the "user-mode" code executes, and in terms of cpu cycles this type of box dwarf all the others by several orders of magnitude. The core servers also normally run as a simple master-slave configuration, with the slave keeping a replicated set of data. This master-slave setup is the only one that has proven sufficiently resilient against failures. So, this is why your insisting SMP systems must be redundant has become an anachronism. The "system" is not the CPU itself anymore. The price to pay for that SMP system be able to withstand presistent hardware failures is not worth paying, neither in compexity, overhead, money, development or other measures. Rather, consentrate on doing the SMP resilient against software, itself and users of all kinds; and accept that a call to panic() is allowed when the hardware turns sour. Other machines will take over without much disruption to the customers. -- mrr IBM 610 workstation computer 3439 Joe Morris I don't recall that on a printer (doesn't mean, of course, that it didn't happen), but there...
|
||||
IBM 610 workstation computer 3437 Alt Folklore Computers from Newsgroups The #1 Usenet Provider on the Internet
|
||||