As per Relevance of the word distribute, we have this rfc below:






K. P. Birman (Cornell
Network Working Group T. A. Joseph (Cornell
Request for Comments: 992 November 1986



On Communication Support for Fault Tolerant Process

K. P. Birman and T. A.
Dept. of Computer Science, Cornell
Ithaca, N.Y. 14853
607-255-9199


1. Status of this Memo

This memo describes a collection of multicast communication primi
tives integrated with a mechanism for handling process failure
recovery. These primitives facilitate the implementation of fault
tolerant process groups, which can be used to provide
services in an environment subject to non-malicious crash failures
Unlike other process group approaches, such as Cheriton's "
groups" (RFC's 966, 988, [Cheriton]), our approach provides
guarantees about the behavior of the communication subsystem
process group membership is changing dynamically, for example due
process or site failures, recoveries, or migration of a process
one site to another. Our approach also addresses delivery
issues that arise when multiple clients communicate with a
group concurrently, or a single client transmits multiple
messages to a group without pausing to wait until each is received
Moreover, the cost of the approach is low. An implementation is be
ing undertaken at Cornell as part of the ISIS project

Here, we argue that the form of "best effort" reliability provided
host groups may not address the requirements of those researchers
are building fault tolerant software. Our basic premise is that re
liable handling of failures, recoveries, and dynamic process migra
tion are important aspects of programming in distributed environ
ments, and that communication support that provides
behavior in the presence of such events places an unacceptable
of complexity on higher level application software. This
does not arise when using the fault-tolerant process group alterna
tive

This memo summarizes our approach and briefly contrasts it with
process group approaches. For a detailed discussion, together
figures that clarify the details of the approach, readers are re
ferred to the papers cited below

Distribution of this memo is unlimited




Birman & Joseph [Page 1]

RFC 992 November 1986


2.

This memo was adopted from a paper presented at the Asilomar
on fault-tolerant distributed computing, March 1986, and
material from a technical report that was issued by Cornell Universi
ty, Dept. of Computer Science, in August 1985, which will appear
ACM Transactions on Computer Systems in February 1987 [Birman-b].
Copies of these paper, and other relevant papers, are available
request from the author: Dept. of Computer Science, Cornell Universi
ty, Ithaca, New York 14853. (birman@gvax.cs.cornell.edu). The
project also maintains a mailing list. To be added to this list
contact M. Schmizzi (schiz@gvax.cs.cornell.edu).

This work was supported by the Defense Advanced Research
Agency (DoD) under ARPA order 5378, Contract MDA903-85-C-0124, and
the National Science Foundation under grant DCR-8412582. The views
opinions and findings contained in this report are those of the au
thors and should not be construed as an official Department of De
fense position, policy, or decision

3.

At Cornell, we recently completed a prototype of the ISIS system
which transforms abstract type specifications into fault-
distributed implementations, while insulating users from the mechan
isms by which fault-tolerance is achieved. This version of ISIS, re
ported in [Birman-a], supports transactional resilient objects as
basic programming abstraction. Our current work undertakes to pro
vide a much broader range of fault-tolerant programming mechanisms
including fault-tolerant distributed bulletin boards [Birman-c]
fault-tolerant remote procedure calls on process groups [Birman-b].
The approach to communication that we report here arose as part
this new version of the ISIS system

Unreliable communication primitives, such as the multicast group com
munication primitives proposed in RFC's 966 and 988 and in [Cheri
ton], leave some uncertainty in the delivery status of a message
failures and other exceptional events occur during communication
Instead, a form of "best effort" delivery is provided, but with
possibility that some member of a group of processes did not
the message if the group membership was changing just as communica
tion took place. When we tried to use this sort of primitive in
original work on ISIS, which must behave reliably in the presence
such events, we had to address this aspect at an application level
The resulting software was complex, difficult to reason about,
filled with obscure bugs, and we were eventually forced to
the entire approach as infeasible

A wide range of reliable communication primitives have been
in the literature, and we became convinced that by using them,
complexity of our software could be greatly reduced. These



Birman & Joseph [Page 2]

RFC 992 November 1986


from reliable and atomic broadcast [Chang] [Cristian] [Schneider]
Byzantine agreement [Strong]. For several reasons, however, the ex
isting work does not solve the problem at hand. The most obvious
that they do not provide a mechanism for sending a message to all
members of a group when the membership is changing dynamically (
"group addressing" problem). In addition, one can identify
ordering issues and questions regarding the detection of communica
tion failures that should be handled within the broadcast mechanism
These motivate a careful reexamination of the entire reliable broad
cast problem

The multicast primitives we report here are designed to
several sorts of ordering constraints, and have cost and latency
varies depending on the nature of the constraint required [Birman-b
[Joseph-a] [Joseph-b]. Failure and recovery are integrated into
communication subsystem by treating these events as a special sort
multicast issued on behalf of a process that has failed or recovered
The primitives are presented in the context of fault tolerant
groups: groups of processes that cooperate to implement some distri
buted algorithm or service, and which need to see consistent order
ings of system events in order to achieve mutually
behavior. Such groups are similar to the host groups of the V
and the ones described in RFC's 966 and 988, but provide
of consistency in just the situations where a host group provides
"best effort" delivery which may sometimes be erroneous

It is helpful to think of our primitives as providing a logical
"virtual" form of reliability: rather than addressing
delivery issues, they ensure that a client will never observe a sys
tem state "inconsistent" with the assumption that reliable
has occurred. Readers familiar with serializability theory may
to think of this as a weaker analog: in serializability, one
interleaved executions of operations provided that the resulting sys
tem state is consistent with the assumption that execution
sequential. Similarly, reliable communication primitives permit de
viations from the reliable delivery abstraction provided that
resulting system state is indistinguishable from one in which reli
able delivery actually did occur

Using our primitives, the ISIS system achieved both high levels
concurrency and suprisingly good performance. Equally important,
structure was made suprisingly simple, making it feasible to
about the correctness of the algorithms that are needed to
high availability even when failures, recoveries, or process migra
tion occurs. More recently, we have applied the same approach to
variety of other problems in distributed computing, and even
a consistent, fault tolerant, distributed bulletin board data struc
ture (a generalized version of the blackboards used in artificial in
telligence programs), with equally good results [Birman-c]. Thus,
feel that the approach has been shown to work in a variety of set
tings where unreliable primitives simply could not be used



Birman & Joseph [Page 3]

RFC 992 November 1986


In the remainder of this memo we summarize the issues and alterna
tives that the designer of a distributed system is presented with
focusing on two styles of support for fault-tolerant computing: re
mote procedure calls coupled with a transactional execution facility
such as is used in the ARGUS system [Liskov], and the fault-
process group mechanism mentioned above. We argue that
interactions are too restrictive to support the sort of
needed, and then show how our primitives can be used to provide
a mechanism. We conclude by speculating on future directions
which this work might be taken

4. Issues in fault-

The difficulty of constructing fault-tolerant distributed
can be traced to a number of interrelated issues. The list that fol
lows is not exhaustive, but attempts to touch on the principal con
siderations that must be addressed in any such system

[1]Synchronization. Distributed systems offer the potential
large amounts of concurrency, and it is usually desirable
operate at as high a level of concurrency as possible. However
when we move from a sequential execution environment to a con
current one, it becomes necessary to synchronize actions that
conflict in their access to shared data or entail
with overlapping sets of processes. Thus, a mechanism is
for ordering conflicting events. Additional problems that
arise in this context include deadlock avoidance or detection
livelock avoidance, etc

[2]Failure detection. It is usually necessary for a fault
tolerant application to have a consistent picture of which com
ponents fail, and in what order. Timeout, the most common mechan
ism for detecting failure, is unsatisfactory, because there
many situations in which a healthy component can timeout
respect to one component without this being detected by
another. Failure detection under more rigorous
requires an agreement protocol that is related to Byzantine agree
ment [Strong] [Hadzilacos]. Regardless of how this problem
solved, some sort of reliable failure detection mechanism will
needed in any fault-tolerant distributed system

[3] Consistency. When a group of processes cooperate in a distri
buted system, it is necessary to ensure that the
processes have consistent views of the state of the group as
whole. For example, if process p believes that some property
holds, and on the basis of this interacts with process q,
state of q should not contradict the fact that p believes A to
true. This problem is closely related to notions of knowledge
consistency in distributed systems [Halpern] [Lamport]. In
context, A will often be the assertion that a multicast has
received by q, or that q saw some sequence of events occur in



Birman & Joseph [Page 4]

RFC 992 November 1986


same order as did p. Thus, it is necessary to be able to
the precise consistency constraints on a distributed software sys
tem, and system support should be available to facilitate
attainment of these constraints

[4] Serializability. Many distributed systems are
into data manager processes, which implement shared variables,
transaction manager processes, which issue requests to
managers [Bernstein]. If transaction managers can execute con
currently, it is desirable to ensure that transactions
serializable outcomes [Eswaren] [Papadimitrou].
is increasingly viewed as an important property in "object
oriented" distributed systems that package services as
objects with which clients communicate by remote procedure
(RPC). On the other hand, there are systems for which serializa
bility is either too strong a constraint, or simply inappropriate
Thus, one needs a way to achieve serializability in
where it will be needed, without imposing system-wide
that would prevent the design of software subsystems for
serializability is not needed

Jointly, these problems render the design of fault-tolerant distri
buted software daunting in the absence of adequate support.
correctness of any proposed design and of its implementation
serious, if not insurmountable, concerns. In Sec. 7, we will
how the primitives of Sec. 6 provide simple ways to overcome all
these issues

5. Existing

If one rules out "unreliable" communication mechanisms, there
basically two fault-tolerant alternatives that can be pursued

The first approach is to provide mechanisms for
interactions between processes that communicate using remote pro
cedure calls [Birrell]. This has lead to work on nested
(due to nested RPC's) [Moss], support for transactions at
language level [Liskov], transactions within an operating
kernel [Spector] [Allchin] [Popek] [Lazowska], and
access to higher-level replicated services, such as resilient
in ISIS or relations in database systems. The primitives in a tran
sactional system provide mechanisms for distributing the request
initiates the transaction, accessing data (which may be replicated),
performing concurrency control, and implementing commit or abort
Additional mechanisms are normally needed for orphan termination
deadlock detection, etc. The issue then arises of how these mechan
isms should themselves be implemented

Our work in ISIS leads us to believe that whereas transactions
easily implemented on top of fault-tolerant process groups -- we
done so -- the converse is much harder. Moreover,



Birman & Joseph [Page 5]

RFC 992 November 1986


represent a relatively heavy-weight solution to the problems
in the previous section, and might impose an unacceptable overhead
subsystems that need to run non-transactionally, for example
a pair of concurrent processes needs to interact on a frequent basis
(We are not saying that "transactional" mechanisms such as
and toplevel actions can't solve this problem, but just that
yield a solution that is awkward and costly). This sort of
has lead us to focus on non-transactional interaction mechanisms,
to treat transactions as a special class of mechanisms used
processes that have been designed to employ a transactional
interact

The second approach involves the provision of a communication primi
tive, such as atomic broadcast, which can be used as the framework
which higher level algorithms are designed. Such a primitive
to deliver messages reliably to some set of destinations, despite
possibility that failures might occur during the execution of
protocol. Above, we termed this the fault tolerant process
approach, since it lends itself to the organization of
processes into groups, as described in the introduction.
groups are an extremely flexible abstraction, and have been
in the V Kernel [Cheriton] and in UNIX, and more recently in the
system. A proposal to provide Internet support for host groups
raised in RFC's 966 and 988. However, the idea of adapting the pro
cess group approach to work reliably in an environment subject to
sorts of exception events and concurrency cited in the previous sec
tion seems to be new

As noted earlier, existing reliable communication protocols do
address the requirements of fault-tolerant process groups. For exam
ple, in [Schneider], an implementation of a reliable multicast primi
tive is described. Such a primitive ensures that a designated mes
sage will be transmitted from one site to all other operational
in a system; if a failure occurs but any site has received the mes
sage, all will eventually do so. [Chang] and [Cristian]
implementations for atomic broadcast, which is a reliable
(sent to all sites in a system) with the additional property
messages are delivered in the same order at all overlapping destina
tions, and this order preserves the transmission order if
originate in a single site

Atomic broadcast is a powerful abstraction, and essentially the
behavior is provided by one of the multicast primitives we discuss
the next section. However, it has several drawbacks which made
hesitant to adopt it as the only primitive in the system. Most seri
ous is the latency that is incurred in order to satisfy the
ordering property. Without delving deeply into the implementations
which are based on a token scheme in [Chang] and an
protocol in [Schneider], we observe that the delaying of certain mes
sages is fundamental to the establishment of a unique global
ordering; indeed, it is easy to prove on knowledge theoretic



Birman & Joseph [Page 6]

RFC 992 November 1986


that this must always be the case. In [Chang] a primary goal is
minimize the number of messages sent, and the protocol given
extremely well in this regard. However, a delay occurs while
for tokens to arrive and the delivery latency that results may
high. [Cristian] assumes that clocks are closely synchronized
that message transit times are bounded by well-known constants,
uses this to derive atomic broadcast protocols tolerant of increas
ingly severe classes of failures. The protocols explicitly
delivery to achieve the desired global ordering on multicasts.
reasons discussed below, this tends to result in high latency in typ
ical local networking environments. An additional drawback of
atomic broadcast protocols is that no mechanism is provided
ensuring that all processes observe the same sequence of failures
recoveries, or for ensuring that failures and recoveries are
relative to ongoing multicasts. Since this problem arises in
setting where one process monitors another, we felt it should
addressed at the same level as the communication protocol. Finally
one wants a group oriented multicast protocol, not a site
broadcast, and this issue must be resolved too

6. Our multicast

We now describe three multicast protocols - GBCAST, ABCAST,
CBCAST - for transmitting a message reliably from a sender process
some set of destination processes. Details of the protocols
their correctness proofs can be found in [Birman-b]. The
ensure "all or nothing" behavior: if any destination receives a mes
sage, then unless it fails, all destinations will receive it.
addressing is discussed in Sec. 6.5.

The failure model that one adopts has a considerable impact on
structure of the resulting system. We adopted the model of fail-
processors [Schneider]: when failures occur, a processor simply
(crashes), as do all the processes executing on it. We also
that individual processes can crash, and that this is detected
it occurs by a monitoring mechanism present at each site.
assumptions are sometimes made about the availability of
realtime clocks. Here, we adopt the position that although reason
ably accurate elapsed-time clocks may be available, closely synchron
ized clocks probably will not be. For example, the 60Hz "line
clocks commonly used on current workstations are only accurate
16ms. On the other hand, 4-8ms inter-site message transit times
common and 1-2ms are reported increasingly often. Thus, it is impos
sible to synchronize clocks to better than 32-48ms, enough time for
pair of sites to exchange between 4 and 50 messages. Even
advancing technology, it seems safe to assume that clock skew
remain "large" when compared to inter-site message
speed. In particular, this argues against time-based protocols
as the one used in [Cristian





Birman & Joseph [Page 7]

RFC 992 November 1986


6.1 The GBCAST

GBCAST (group multicast) is the most constrained, and costly,
the three primitives. It is used to transmit information
failures and recoveries to members of a process group. A recov
ering member uses GBCAST to inform the operational ones that
has become available. Additionally, when a member fails,
system arranges for a GBCAST to be issued to group members on
behalf, informing them of its failure. Arguments to GBCAST are
message and a process group identifier, which is translated
a set of destinations as described below (Sec. 6.5).

Our GBCAST protocol ensures that if any process receives a multi
cast B before receiving a GBCAST G, then all overlapping destina
tions will receive B before G <1> This is true regardless of
type of multicast involved. Moreover, when a failure occurs,
corresponding GBCAST message is delivered after any other multi
casts from the failed process. Each member can therefore main
tain a VIEW listing the membership of the process group,
it when a GBCAST is received. Although VIEW's are not
simultaneously in real time, all members observe the
sequence of VIEW changes. Since, GBCAST's are ordered
to all other multicasts, all members receiving a given
will have the same value of VIEW when they receive it

Notice that GBCAST also provides a convenient way to change
global properties of a group "atomically". In our work, we
used GBCAST to dynamically change a ranking on the members of
group, to request that group members establish checkpoints
use if recovery is needed after all failure, and to
process migration. In each case, the ordering of GBCAST
to other events that makes it possible to perform the
action without running any additional protocol. Other uses
GBCAST will no doubt emerge as our research continues

Members of a process group can also use the value of VIEW to
a strategy for processing an incoming request, or to react
failure or recovery without having to run any special
first. Since the GBCAST ordering is the same everywhere,
actions will all be consistent. Notice that when all the
of a process group may have failed, GBCAST also provides an inex
pensive way to determine the last site that failed: process
members simply log each value of VIEW that becomes defined
stable storage before using it; a simplified version of the algo
rithm in [Skeen-a] can then be executed when recovering
failure








Birman & Joseph [Page 8]

RFC 992 November 1986


6.2 The ABCAST

The GBCAST primitive is too costly to be used for general commun
ication between process group members. This motivates the intro
duction of weaker (less ordered) primitives, which might be
in situations where a total order on multicast messages is
necessary. Our second primitive, ABCAST (atomic multicast),
satisfies such a weaker constraint. Specifically, it is
desired that if two multicasts are received in some order at
common destination site, they be received in that order at
other common destinations, even if this order was not predeter
mined. For example, if a process group is being used to
a replicated queue and ABCAST is used to transmit queue opera
tions to all copies, the operations will be done in the
order everywhere, hence the copies of the queue will remain mutu
ally consistent. The primitive ABCAST(msg, label, dests) pro
vides this behavior. Two ABCAST's having the same label
delivered in the same order at all common destinations

6.3 The CBCAST

Our third primitive, CBCAST (causal multicast), is weakest in
sense that it involves less distributed synchronization
GBCAST or ABCAST. CBCAST(msg, dests) atomically delivers msg
each operational dest. The CBCAST protocol ensures that if
multicasts are potentially causally dependent on another,
the former is delivered after the latter at all overlapping des
tinations. A multicast B' is potentially causally dependent on
multicast B if both multicasts originate from the same process
and B' is sent after B, or if there exists a chain of
transmissions and receptions or local events by which
could have been transferred from the process that issued B to
process that issued B' [Lamport]. For causally independent mul
ticasts, the delivery ordering is not constrained

CBCAST is valuable in systems like ISIS, where concurrency con
trol algorithms are used to synchronize concurrent computations
In these systems, if two processes communicate concurrently
the same process the messages are almost always independent
that can be processed in any order: otherwise, concurrency con
trol would have caused one to pause until the other was finished
On the other hand, order is clearly important within a
linked series of multicasts, and it is precisely this sort
order that CBCAST respects

6.4 Other multicast

A weaker multicast primitive is reliable multicast, which pro
vides all-or-nothing delivery, but no ordering properties.
formulation of CBCAST in [Birman-b] actually includes a
for performing multicasts of this sort, hence no



Birman & Joseph [Page 9]

RFC 992 November 1986


primitive is needed for the purpose. Additionally, there may
situations in which ABCAST protocols that also satisfy a
ordering property would be valuable. Our ABCAST primitive
be changed to respect such a rule, and we made use of a
primitive that is simultaneously causal and atomic in our work
consistent shared bulletin boards ([Birman-c]). For simplicity
the presentation here assumes that ABCAST is completely orthogo
nal to CBCAST, but a simple way to build an efficient "
atomic" multicast is described in our full-length paper.
cost of this protocol is only slightly higher than that
ABCAST

6.5 Group addressing

Since group membership can change dynamically, it may be diffi
cult for a process to compute a list of destinations to which
message should be sent, for example, as is needed to perform
GBCAST. In [Birman-b] we report on a protocol for ensuring
a given multicast will be delivered to all members of a
group in the same view. This view is either the view that
operative when the message transmission was initiated, or a
that was defined subsequently. The algorithm is a simple itera
tive one that costs nothing unless the group membership changes
and permits the caching of possibly inaccurate membership infor
mation near processes that might want to communicate with
group. Using the protocol, a flexible message addressing
can readily be supported

Iterative addressing is only required when the process transmit
ting a message has an inaccurate copy of the process group view
In the implementation we are now building, this would rarely
the case, and iteration is never needed if the view is known
be accurate. Thus, iterated delivery should be very infrequent

6.6 Synchronous versus asynchronous multicast

Many systems employ RPC internally, as a lowest level
for interaction between processes. It should be evident that
of our multicast primitives can be used to implement
remote procedure calls [Cooper]: the caller would simply
until replies have been received from all the
(observation of a failure constitutes a reply in this case).
term such a use of the primitives synchronous, to distinguish
from from an asynchronous multicast in which no replies, or
one reply, suffices

In our work on ISIS, GBCAST and ABCAST are normally invoked syn
chronously, to implement a remote procedure call by one member
an object on all the members of its process group. However
CBCAST, which is the most frequently used overall, is
never invoked synchronously. Asynchronous CBCAST's are



Birman & Joseph [Page 10]

RFC 992 November 1986


primary source of concurrency in ISIS: although the delivery ord
ering is assured, transmission can be delayed to enable a
to be piggybacked on another, or to schedule IO within the
as a whole. While the system cannot defer an asynchronous multi
cast indefinitely, the ability to defer it a little,
delaying some computation by doing so, permits load to
smoothed. Since CBCAST respects the delivery orderings on
a computation might depend, and is ordered with respect
failures, the concurrency introduced does not complicate
level algorithms. Moreover, the protocol itself is
cheap

A problem is introduced by our decision to allow
multicasts: the atomic reception property must now be extended
address causally related sequences of asynchronous messages.
a failure were to result in some multicasts being delivered
all their destinations but others that precede them not
delivered anywhere, inconsistency might result even if the desti
nations do not overlap. We therefore extend the atomicity pro
perty as follows. If process t receives a message m from
s, and s subsequently fails, then unless t fails as well,
messages m' that s received prior to its failure must
delivered to their remaining operational destinations. This
because the state of t may now depend on the contents of any
m', hence the system state could become inconsistent if
delivery of m' were not completed. The costs of the
are not affected by this change

A second problem arises when the user-level implications of
atomicity rule are considered. In the event of a failure,
suffix of a sequence of aysnchronous multicasts could be lost
the system state would still be internally consistent. A
that is about to take some action that may leave an
visible side-effect will need a way to pause until it
guaranteed that such multicasts have actually been delivered
For this purpose, a flush primitive is provided.
calls to flush do not eliminate the benefit of using CBCAST asyn
chronously. Unless the system has built up a considerable back
log of undelivered multicast messages, which should be rare
flush will only pause while transmission of the last few multi
casts complete

7. Using the

The reliable communication primitives described above lead to
solutions for the problems cited in Sec. 4:

[1] Synchronization. Many synchronization problems are
into the primitives themselves. For example, consider the use
GBCAST to implement recovery. A recovering process would issue
GBCAST to the process group members, requesting that



Birman & Joseph [Page 11]

RFC 992 November 1986


information be transferred to it. In addition to sending
current state of the group to the recovering process,
members update the process group view at this time.
messages to the group will be delivered to the recovered process
with all necessary synchronization being provided by the
properties of GBCAST. In situations where other forms of syn
chronization are needed, ABCAST provides a simple way to
that several processes take actions in the same order, and
form of low-level synchronization simplifies a number of higher
level synchronization problems. For example, if ABCAST is
to do P() and V() operations on a distributed semaphore,
order of operations on the semaphore is set by the ABCAST,
all the managers of the semaphore see these operations in a
order

[2] Failure detection. Consistent failure (and recovery) detec
tion are trivial using our primitives: a process simply waits
the appropriate process group view to change. This
the implementation of algorithms in which one processes
the status of another process. A process that acts on the
of a process group view change does so with the assurance
other group members will (eventually) observe the same event
will take consistent actions

[3] Consistency. We believe that consistency is
expressible as a set of atomicity and ordering constraints
message delivery, particularly causal ones of the sort
by CBCAST. Our primitives permit a process to specify the com
munication properties needed to achieve a desired form of con
sistency. Continued research will be needed to understand pre
cisely how to pick the weakest primitive in a designated situa
tion

[4] Serializability. To achieve serializability, one
a concurrency control algorithm and then forces computations
respect the serialization order that this algorithm choses.
ABCAST primitive, as observed above, is a powerful tool
establishing an order between concurrent events, e.g. by
acquisition. Having established such an order, CBCAST can
used to distribute information about the computation and also
termination (commit or abort). Any process that observes
commit or abort of a computation will only be able to
with data managers that have received messages preceding the com
mit or abort, hence a highly asynchronous transactional
results. If a process running a computation fails, this
detected when a failure GBCAST is received instead of the commit
Thus, executions are simple and quite deterministic

If commit is conditional, CBCAST would be used to first interro
gate participants to learn if they are prepared to commit,
then to transmit the commit or abort decision (the usual two



Birman & Joseph [Page 12]

RFC 992 November 1986


phase commit). On the other hand, conditional commits can
be avoided using our approach. A method for building transac
tions that will roll-forward after failure after failure is dis
cussed in more detail in [Birman-a] [Joseph-a] [Joseph-b].
forms of concurrency control, such as timestamp generation,
similarly be implemented using ABCAST and CBCAST. We view tran
sactional data storage as an application-level concern, which
be handled using a version stack approach or a multi-
store, or any other appropriate mechanism

8.

The communication primitives can be built in layers, starting with
bare network providing unreliable Internet datagrams. The
structure is, however, less mature and more complex than the one sug
gested in RFC's 966 and 988. For example, at this stage of
research we do not understand how to optimize our protocols to
same extent as for the unreliable host multicast approach
in those RFC's. Thus, the implementation we describe here should
understood to be a prototype. A particularly intriguing question
which we are investigating actively, concerns the use of a "
effort" ethernet or Internet multicast as a tool to optimize
implementation of our protocols

Our basic approach is to view large area networks as a set of clus
ters of sites interconnected by high speed LAN devices and intercon
nected by slower long-haul links. We first provide protocols for
within clusters, and then extend them to run between clusters too
Network partitioning can be tolerated at all levels of the
in the sense that no incorrect actions can result after network par
titioning, although our approach will sometimes block until the par
tition is repaired. Our protocols also tend to block within a clus
ter while the list of operational sites for that cluster is
changed. In normal LAN's, this happens infrequently (during
failure or recovery), and would not pose a problem. (In
intensive applications, alternative protocols might be needed
address this issue).

The lowest level of our software uses a site-to-site
protocol to convert the unreliable packet transport this into
sequenced, error-free message abstraction, using timeouts to
apparent failures. TCP can also be used for this purpose,
that a "filter" is placed on the incoming message stream and
types of messages are handled specially. An agreement protocol
then used to order the site-failures and recoveries consistently.
timeouts cause a failure to be detected erroneously, the
forces the affected site to undergo recovery

Built on this is a layer that supports the primitives themselves
CBCAST has a very light-weight implementation, based on the idea
flooding the system with copies of a message: Each process



Birman & Joseph [Page 13]

RFC 992 November 1986


copies of any messages needed to ensure the consistency of its
of the system. If message m is delivered to process p, and m
potentially causally dependent on a message m prime, then a copy of
prime is sent to p as well (duplicates are discarded). A
collector deletes superfluous copies after a message has reached
its destinations. By using extensive piggybacking and a
scheduling algorithm to control message transmission, the cost of
CBCAST is kept low -- often, less than one packet per destination
ABCAST employs a two-phase protocol based on one suggested to us
Skeen [Skeen-b]. This protocol has higher latency than
because delivery can only occur during the second phase; ABCAST
thus inherently synchronous. In ISIS, however, ABCAST is
rarely; we believe that this would be the case in other systems
well. GBCAST is implemented using a two-phase protocol similar
the one for ABCAST, but with an additional mechanism that
messages from a failed process before delivering the GBCAST announc
ing the failure. Although GBCAST is slower than ABCAST or CBCAST,
is used rarely enough so that performance is probably less of
issue here -- and in any case, even GBCAST could be tuned to
very high throughput. Preliminary performance figures appear
[Birman-b].

Although satisfactory performance should be possible using an imple
mentation that sits on top of a conventional Internet mechanism,
should be noted that to achieve really high rates of
the layers of software described above must reside in the kernel
because they run on behalf of large numbers of clients, run fre
quently, and tend to execute for very brief periods before doing I/
and pausing. A non-kernel implementation will thus incur
scheduling and context switching overhead. Additionally, it is
at all clear how to use ethernet style broadcast mechanisms to optim
ize the performance of this sort of protocol, although it should
possible. We view this as an interesting area for research

A forthcoming paper will describe higher level software that we
building on top of the basic fault-tolerant process group
described above

9.

The experience of implementing a substantial fault-tolerant
left us with insights into the properties to be desired from a com
munication subsystem. In particular, we became convinced that
build a reliable distributed system, one must start with a
communication subsystem. The multicast primitives described in
memo present a simple interface, achieve a high level of concurrency
can be used in both local and wide area networks, and are
to software ranging from distributed database systems to the fault
tolerant objects and bulletin boards provided by ISIS. Because
are integrated with failure handling mechanisms and respect
event orderings, they introduce a desirable form of determinism



Birman & Joseph [Page 14]

RFC 992 November 1986


distributed computation without compromising efficiency. A conse
quence is that high-level algorithms are greatly simplified,
the probability of error. We believe that this is a very
and practical approach to building large fault-tolerant
systems, and it is the only one we know of that leads to a
form of confidence in the resulting software

NOTES

<1> A problem arises if a process p fails without receiving some mes
sage after that message has already been delivered to some other pro
cess q: q's VIEW when it received the message would show p to
operational; hence, q will assume that p received the message
although p is physically incapable of doing so. However, the
of the system is now equivalent to one in which p did receive
message, but failed before acting on it. In effect, there exists
interpretation of the actual system state that is consistent with q'
assumption. Thus, GBCAST satisfies the sort of logical delivery pro
perty cited in the introduction



































Birman & Joseph [Page 15]

RFC 992 November 1986


10.

[RFC966] Deering, S. and Cheriton, D. Host groups: A multicast exten
sion to the internet protocol. Stanford University,
1985.

[RFC988] Deering, S. Host extensions for IP multicasting.
University, July 1986.

[Allchin] Allchin, J., McKendry, M. Synchronization and recovery
actions. Proc. 2nd ACM SIGACT/SIGOPS Principles of
Computing, Montreal, Canada, 1983.

[Babaoglu] Babaoglu, O., Drummond, R. The streets of Byzantium:
architectures for fast reliable multicast. IEEE Trans.
Software Engineering TSE-11, 6 (June 1985).

[Bernstein] Bernstein, P., Goodman, N. Concurrency control
for replicated database systems. ACM Computing Surveys 13, 2
(June 1981), 185-222.

[Birman-a] Birman, K. Replication and fault-tolerance in the ISIS sys
tem. Proc. 10th ACM SIGOPS Symposium on Operating Systems Princi
ples. Orcas Island, Washington, Dec. 1985, 79-86.

[Birman-b] Birman, K., Joseph, T. Reliable communication in the pres
ence of failures. Dept. of Computer Science, Cornell Univ.,
85-694, Aug. 1985. To appear in ACM TOCS (Feb. 1987).

[Birman-c] Birman, K., Joseph, T., Stephenson, P. Programming
fault tolerant bulletin boards in asynchronous distributed sys
tems. Dept. of Computer Science, Cornell Univ., TR 85-788, Aug
1986.

[Birrell] Birrell, A., Nelson, B. Implementing remote procedure calls
ACM Transactions on Computer Systems 2, 1 (Feb. 1984), 39-59.

[Chang] Chang, J., Maxemchuck, M. Reliable multicast protocols.
TOCS 2, 3 (Aug. 1984), 251-273.

[Cheriton] Cheriton, D. The V Kernel: A software base for
systems. IEEE Software 1 12, (1984), 19-43.

[Cooper] Cooper, E. Replicated procedure call. Proc. 3rd ACM
on Principles of Distributed Computing., August 1984, 220-232.
(May 1985).

[Cristian] Cristian, F. et al Atomic multicast: From simple diffusion
Byzantine agreement. IBM Technical Report RJ 4540 (48668), Oct
1984.




Birman & Joseph [Page 16]

RFC 992 November 1986


[Eswaren] Eswaren, K.P., et al The notion of consistency and
locks in a database system. Comm. ACM 19, 11 (Nov. 1976), 624-
633.

[Hadzilacos] Hadzilacos, V. Byzantine agreement under restricted
of failures (not telling the truth is different from telling
lies). Tech. ARep. TR-19-83, Aiken Comp. Lab., Harvard
(June 1983).

[Halpern] Halpern, J., and Moses, Y. Knowledge and common knowledge
a distributed environment. Tech. Report RJ-4421, IBM San
Research Laboratory, 1984.

[Joseph-a] Joseph, T. Low cost management of replicated data. Ph.D
dissertation, Dept. of Computer Science, Cornell Univ.,
(Dec. 1985).

[Joseph-b] Joseph, T., Birman, K. Low cost management of
data in fault-tolerant distributed systems. ACM TOCS 4, 1 (
1986), 54-70.

[Lamport] Lamport, L. Time, clocks, and the ordering of events in
distributed system. CACM 21, 7, July 1978, 558-565.

[Lazowska] Lazowska, E. et al The architecture of the EDEN system
Proc. 8th Symposium on Operating Systems Principles, Dec. 1981,
148-159.

[Liskov] Liskov, B., Scheifler, R. Guardians and actions:
support for robust, distributed programs. ACM TOPLAS 5, 3 (
1983), 381-404.

[Moss] Moss, E. Nested transactions: An approach to reliable, distri
buted computing. Ph.D. thesis, MIT Dept of EECS, TR 260,
1981.

[Papadimitrou] Papadimitrou, C. The serializability of concurrent data
base updates. JACM 26, 4 (Oct. 1979), 631-653.

[Popek] Popek, G. et al. Locus: A network transparent, high
distributed system. Proc. 8th Symposium on Operating
Principles, Dec. 1981, 169-177.

[Schlicting] Schlicting, R, Schneider, F. Fail-stop processors:
approach to designing fault-tolerant distributed computing sys
tems. ACM TOCS 1, 3, August 1983, 222-238.

[Schneider] Schneider, F., Gries, D., Schlicting, R. Reliable
protocols. Science of computer programming 3, 2 (March 1984).

[Skeen-a] Skeen, D. Determining the last process to fail. ACM TOCS 3,



Birman & Joseph [Page 17]

RFC 992 November 1986


1, Feb. 1985, 15-30.

[Skeen-b] Skeen, D. A reliable multicast protocol. Unpublished

[Spector] Spector, A., et al Distributed transactions for reliable sys
tems. Proc. 10th ACM SIGOPS Symposium on Operating Systems Prin
ciples, Dec. 1985, 127-146.

[Strong] Strong, H.R., Dolev, D. Byzantine agreement. Digest of papers
Spring Compcon 83, San Francisco, CA, March 1983, 77-81.












































Birman & Joseph [Page 18]








if you see any problems within the linking, don't worry be happy,
this is version 0.1 of the Relevance System and you gotta expect some crappy subroutines sometimes,
just be content we did not write this in Java, which would have made this "bigger and better" HAHAHHA.




RFC documents can be found at I.E.T.F.



Relevance System Copyright © 2002 Spectrum WorldResearch
other technical nosh by ServerMasters Corporation
collaboration of BobX







Spectrum