PLEX86  x86- Virtual Machine (VM) Program
 CVS  |  Mailing List  |  Download  |  Newsgroups

Crash detection by OS


Your Ad Here

Your Ad Here

Today's mainframeanything to new 553
All sessions to the machines in question are via ssh and a pretty paranoid vpn based on OpenBSD boxes. This...

Del Cecchi processor crash

for ha-cmp

besides distributed lock manager, we had to do cluster membership management ... removal of member on possible failure as well as re-integrating a member back into the cluster.

we were able to draw on decades of experience with mainframe clusters ... as well as some of the vax-cluster history. for the distributed lock manager, we collected some information from some of the DBMS vendors that had done vax-cluster based implementations ... and had their list of things done wrong in vms. we had the advantage of a clean slate and starting from scratch.

May 6 1955: First disk storage demonstration
On May 6, 1955, IBM publicly demonstrated its new invention, disk storage. The disk drive would hold...

for cluster membership management there was heartbeat timer (another name for watchdog) mechanism. a problem then becomes shooting down a cluster member that is believed to have failed. the long-time model has been mainframe loosely-coupled (cluster by any other name) where different processors had shared concurrent access to the same disk drives (dating back to the 60s).

the age-old scenario that you try to handle ... is somebody pushes the processor "STOP" button on the front-panel when the operating system is one instruction in front of a disk start i-o operation instruction. the rest of the cluster eventually decides the affected processor has failed ... goes into recovery and take-over ... and removes the "failed" processor from the active cluster membership ... and possibly re-buttigns responsibility for resuming processes that were in progress on the failed complex.

the scenario then has the "START" button pressed for the "failed" processor and it proceeds to the disk start i-o operation. allowing the succesful end of such a disk start i-o operation may obliterate valid information that has gone on while the processor was in the stop state. For this scenario ... the cluster mechanism needs a methodology that fences off processors that are believed to have failed (i.e. removed from standard cluster membership).

misc. old ha-cmp reference:

some number of past posts on distributed lock manager (or DLM ... which is unrelated to recent posts concerning DRM): outer cylinders still faster than inner cylinders? Public Key Exchange without Digital Certificates relational data (was: Re: CAS and LL-SC (was Re: High Level buttembler for MVS & VM & VSE))



Your Ad Here

List | Previous | Next

May 6 1955: First disk storage demonstration

Alt Folklore Computers from Newsgroups

The #1 Usenet Provider on the Internet

Outsourcing 549