As per Relevance of the word connection, we have this rfc below:







RFC: 816



FAULT ISOLATION AND

David D.
MIT Laboratory for Computer
Computer Systems and Communications
July, 1982


1.


Occasionally, a network or a gateway will go down, and the

of hops which the packet takes from source to destination must change

Fault isolation is that action which hosts and gateways

take to determine that something is wrong; fault recovery is

identification and selection of an alternative route which will serve

reconnect the source to the destination. In fact, the gateways

most of the functions of fault isolation and recovery. There are

however, a few actions which hosts must take if they wish to provide

reasonable level of service. This document describes the portion

fault isolation and recovery which is the responsibility of the host


2. What Gateways


Gateways collectively implement an algorithm which identifies

best route between all pairs of networks. They do this by

packets which contain each gateway's latest opinion about

operational status of its neighbor networks and gateways. Assuming

this algorithm is operating properly, one can expect the gateways to

through a period of confusion immediately after some network or

2


has failed, but one can assume that once a period of negotiation

passed, the gateways are equipped with a consistent and correct model

the connectivity of the internet. At present this period of

may actually take several minutes, and many TCP implementations time

within that period, but it is a design goal of the eventual

that the gateway should be able to reconstruct the topology

enough that a TCP connection should be able to survive a failure of

route


3. Host Algorithm for Fault


Since the gateways always attempt to have a consistent and

model of the internetwork topology, the host strategy for fault

is very simple. Whenever the host feels that something is wrong,

asks the gateway for advice, and, assuming the advice is forthcoming,

believes the advice completely. The advice will be wrong only

the transient period of negotiation, which immediately follows

outage, but will otherwise be reliably correct


In fact, it is never necessary for a host to explicitly ask

gateway for advice, because the gateway will provide it as appropriate

When a host sends a datagram to some distant net, the host should

prepared to receive back either of two advisory messages which

gateway may send. The ICMP "redirect" message indicates that

gateway to which the host sent the datagram is not longer the

gateway to reach the net in question. The gateway will have

the datagram, but the host should revise its routing table to have

different immediate address for this net. The ICMP "

3


unreachable" message indicates that as a result of an outage, it

currently impossible to reach the addressed net or host in any manner

On receipt of this message, a host can either abandon the

immediately without any further retransmission, or resend slowly to

if the fault is corrected in reasonable time


If a host could assume that these two ICMP messages would

arrive when something was amiss in the network, then no other action

the part of the host would be required in order maintain its tables

an optimal condition. Unfortunately, there are two circumstances

which the messages will not arrive properly. First, during

transient following a failure, error messages may arrive that do

correctly represent the state of the world. Thus, hosts must take

isolated error message with some scepticism. (This transient period

discussed more fully below.) Second, if the host has been

datagrams to a particular gateway, and that gateway itself crashes,

all the other gateways in the internet will reconstruct the topology

but the gateway in question will still be down, and therefore

provide any advice back to the host. As long as the host continues

direct datagrams at this dead gateway, the datagrams will simply

off the face of the earth, and nothing will come back in return.

must detect this failure


If some gateway many hops away fails, this is not of concern to

host, for then the discovery of the failure is the responsibility of

immediate neighbor gateways, which will perform this action in a

invisible to the host. The problem only arises if the very

4


gateway, the one to which the host is immediately sending the datagrams

fails. We thus identify one single task which the host must perform

its part of fault isolation in the internet: the host must use

strategy to detect that a gateway to which it is sending datagrams

dead


Let us assume for the moment that the host implements

algorithm to detect failed gateways; we will return later to

what this algorithm might be. First, let us consider what the

should do when it has determined that a gateway is down. In fact,

the exception of one small problem, the action the host should take

extremely simple. The host should select some other gateway, and

sending the datagram to it. Assuming that gateway is up, this

either produce correct results, or some ICMP advice. Since we

that, ignoring temporary periods immediately following an outage,

gateway is capable of giving correct advice, once the host has

advice from any gateway, that host is in as good a condition as it

hope to be


There is always the unpleasant possibility that when the host

a different gateway, that gateway too will be down. Therefore,

algorithm the host uses to detect a dead gateway must continuously

applied, as the host tries every gateway in turn that it knows about


The only difficult part of this algorithm is to specify the

by which the host maintains the table of all of the gateways to which

has immediate access. Currently, the specification of the

protocol does not architect any message by which a host can ask to

5


supplied with such a table. The reason is that different networks

provide very different mechanisms by which this table can be filled in

For example, if the net is a broadcast net, such as an ethernet or

ringnet, every gateway may simply broadcast such a table from time

time, and the host need do nothing but listen to obtain the

information. Alternatively, the network may provide the mechanism

logical addressing, by which a whole set of machines can be

with a single group address, to which a request can be sent

assistance. Failing those two schemes, the host can build up its

of neighbor gateways by remembering all the gateways from which it

ever received a message. Finally, in certain cases, it may be

for this table, or at least the initial entries in the table, to

constructed manually by a manager or operator at the site. In

where the network in question provides absolutely no support for

kind of host query, at least some manual intervention will be

to get started, so that the host can find out about at least

gateway


4. Host Algorithms for Fault


We now return to the question raised above. What strategy

the host use to detect that it is talking to a dead gateway, so that

can know to switch to some other gateway in the list. In fact, there

several algorithms which can be used. All are reasonably simple

implement, but they have very different implications for the overhead

the host, the gateway, and the network. Thus, to a certain extent,

algorithm picked must depend on the details of the network and of

host

6



1. NETWORK LEVEL


Many networks, particularly the Arpanet, perform precisely

required function internal to the network. If a host sends a

to a dead gateway on the Arpanet, the network will return a "host dead

message, which is precisely the information the host needs to know

order to switch to another gateway. Some early implementations

Internet on the Arpanet threw these messages away. That is

exceedingly poor idea


2. CONTINUOUS


The ICMP protocol provides an echo mechanism by which a host

solicit a response from a gateway. A host could simply send

message at a reasonable rate, to assure itself continuously that

gateway was still up. This works, but, since the message must be

fairly often to detect a fault in a reasonable time, it can imply

unbearable overhead on the host itself, the network, and the gateway

This strategy is prohibited except where a specific analysis

indicated that the overhead is tolerable


3. TRIGGERED


If the use of polling could be restricted to only those times

something seemed to be wrong, then the overhead would be bearable

Provided that one can get the proper advice from one's higher

protocols, it is possible to implement such a strategy. For example

one could program the TCP level so that whenever it retransmitted

7


segment more than once, it sent a hint down to the IP layer

triggered polling. This strategy does not have excessive overhead,

does have the problem that the host may be somewhat slow to respond

an error, since only after polling has started will the host be able

confirm that something has gone wrong, and by then the TCP above

have already timed out


Both forms of polling suffer from a minor flaw. Hosts as well

gateways respond to ICMP echo messages. Thus, polling cannot be used

detect the error that a foreign address thought to be a gateway

actually a host. Such a confusion can arise if the physical

of machines are rearranged


4. TRIGGERED


There is a strategy which makes use of a hint from a higher level

as did the previous strategy, but which avoids polling altogether

Whenever a higher level complains that the service seems to

defective, the Internet layer can pick the next gateway from the list

available gateways, and switch to it. Assuming that this gateway is up

no real harm can come of this decision, even if it was wrong, for

worst that will happen is a redirect message which instructs the host

return to the gateway originally being used. If, on the other hand,

original gateway was indeed down, then this immediately provides a

route, so the period of time until recovery is shortened. This

strategy seems particularly clever, and is probably the most

suitable for those cases where the network itself does not provide

isolation. (Regretably, I have forgotten who suggested this idea to me

It is not my invention.)

8


5. Higher Level Fault


The previous discussion has concentrated on fault detection

recovery at the IP layer. This section considers what the higher

such as TCP should do


TCP has a single fault recovery action; it repeatedly retransmits

segment until either it gets an acknowledgement or its connection

expires. As discussed above, it may use retransmission as an event

trigger a request for fault recovery to the IP layer. In the

direction, information may flow up from IP, reporting such things

ICMP Destination Unreachable or error messages from the

network. The only subtle question about TCP and faults is what

should do when such an error message arrives or its connection

expires


The TCP specification discusses the timer. In the description

the open call, the timeout is described as an optional value that

client of TCP may specify; if any segment remains unacknowledged

this period, TCP should abort the connection. The default for

timeout is 30 seconds. Early TCPs were often implemented with a

timeout interval, but this did not work well in practice, as

following discussion may suggest


Clients of TCP can be divided into two classes: those running

immediate behalf of a human, such as Telnet, and those supporting

program, such as a mail sender. Humans require a sophisticated

to errors. Depending on exactly what went wrong, they may want

9


abandon the connection at once, or wait for a long time to see if

get better. Programs do not have this human impatience, but also

the power to make complex decisions based on details of the exact

condition. For them, a simple timeout is reasonable


Based on these considerations, at least two modes of operation

needed in TCP. One, for programs, abandons the connection

exception if the TCP timer expires. The other mode, suitable

people, never abandons the connection on its own initiative, but

to the layer above when the timer expires. Thus, the human user can

error messages coming from all the relevant layers, TCP and ICMP,

can request TCP to abort as appropriate. This second mode requires

TCP be able to send an asynchronous message up to its client to

the timeout, and it requires that error messages arriving at

layers similarly flow up through TCP


At levels above TCP, fault detection is also required. Either

the following can happen. First, the foreign client of TCP can fail

even though TCP is still running, so data is still acknowledged and

timer never expires. Alternatively, the communication path can fail

without the TCP timer going off, because the local client has no data

send. Both of these have caused trouble


Sending mail provides an example of the first case. When

mail using SMTP, there is an SMTP level acknowledgement that is

when a piece of mail is successfully delivered. Several early

receiving programs would crash just at the point where they had

all of the mail text (so TCP did not detect a timeout due to

10


unacknowledged data) but before the mail was acknowledged at the

level. This failure would cause early mail senders to wait forever

the SMTP level acknowledgement. The obvious cure was to set a timer

the SMTP level, but the first attempt to do this did not work, for

was no simple way to select the timer interval. If the

selected was short, it expired in normal operational when sending

large file to a slow host. An interval of many minutes was needed

prevent false timeouts, but that meant that failures were detected

very slowly. The current solution in several mailers is to pick

timeout interval proportional to the size of the message


Server telnet provides an example of the other kind of failure.

can easily happen that the communications link can fail while there

no traffic flowing, perhaps because the user is thinking. Eventually

the user will attempt to type something, at which time he will

that the connection is dead and abort it. But the host end of

connection, having nothing to send, will not discover anything wrong

and will remain waiting forever. In some systems there is no way for

user in a different process to destroy or take over such a

process, so there is no way to recover


One solution to this would be to have the host server telnet

the user end now and then, to see if it is still up. (Telnet does

have an explicit query feature, but the host could negotiate

unimportant option, which should produce either agreement

disagreement in return.) The only problem with this is that

reasonable sample interval, if applied to every user on a large system

11


can generate an unacceptable amount of traffic and system overhead.

smart server telnet would use this query only when something

wrong, perhaps when there had been no user activity for some time


In both these cases, the general conclusion is that client

error detection is needed, and that the details of the mechanism

very dependent on the application. Application programmers must be

aware of the problem of failures, and must understand that

detection at the TCP or lower level cannot solve the whole problem

them


6. Knowing When to Give


It is not obvious, when error messages such as ICMP

Unreachable arrive, whether TCP should abandon the connection.

reason that error messages are difficult to interpret is that,

discussed above, after a failure of a gateway or network, there is

transient period during which the gateways may have

information, so that irrelevant or incorrect error messages

sometimes return. An isolated ICMP Destination Unreachable may

at a host, for example, if a packet is sent during the period when

gateways are trying to find a new route. To abandon a TCP

based on such a message arriving would be to ignore the valuable

of the Internet that for many internal failures it reconstructs

function without any disruption of the end points


But if failure messages do not imply a failure, what are they for

In fact, error messages serve several important purposes. First,

12


they arrive in response to opening a new connection, they probably

caused by opening the connection improperly (e.g., to a non-

address) rather than by a transient network failure. Second,

provide valuable information, after the TCP timeout has occurred, as

the probable cause of the failure. Finally, certain messages, such

ICMP Parameter Problem, imply a possible implementation problem.

general, error messages give valuable information about what went wrong

but are not to be taken as absolutely reliable. A general

mechanism, such as the TCP timeout discussed above, provides a

indication that whatever is wrong is a serious condition, but

the advisory messages to augment the timer, there is no way for

client to know how to respond to the error. The combination of

timer and the advice from the error messages provide a reasonable set

facts for the client layer to have. It is important that error

from all layers be passed up to the client module in a useful

consistent way


-------







if you see any problems within the linking, don't worry be happy,
this is version 0.1 of the Relevance System and you gotta expect some crappy subroutines sometimes,
just be content we did not write this in Java, which would have made this "bigger and better" HAHAHHA.




RFC documents can be found at I.E.T.F.



Relevance System Copyright © 2002 Spectrum WorldResearch
other technical nosh by ServerMasters Corporation
collaboration of BobX







Spectrum