Re: SCTP failover

From: Randall Stewart <rrs@cisco.com>
Date: Fri Feb 11 2005 - 14:18:40 EST
('binary' encoding is not supported, stored as-is) ('binary' encoding is not supported, stored as-is) Brian:

Good point.. this could be an issue with a default
route or non-subnet routes in place.. much like
what happens the first few days of the inter-op when
folks forget to setup explicit network routes for
the two different sub-nets..

Only problem with that theory is that he would not
get any traffic ever through unless the interface
failed and the route was actually removed. As long as an
interface was good .. then the route would have stayed
in place and no traffic would have gotten there.. but
it is a possiblity.. depending on what type of failure is
involved..

R

Brian F. G. Bidulock wrote:
> Anatoly,
>
> One thing that can impact failover performance is the source
> address and/or transmitting interface on retransmitted messages.
>
> If the transmitting interface does not change for retransmitted
> messages (because there exists a route from each interface to
> each peer transport address, or due to an error), then
> retransmitted messages will never appear. This sounds possibly
> like what is happening. You can play with the routing table a
> bit to make sure that you have two completely independent
> subnets described for diagnosis.
>
> If the retransmitting interface changes, but the source address
> placed in the retransmitted packets does not change, then the
> receiver is likely attempting to send SACKs on the failed path
> (if following the rule that SACKs are sent to the source address
> of the data they are acknowledging). So, if the data is one-way
> and the sender does not change the source address to match the
> actual outgoing interface on retransmitted messages, SACKs go to
> the wrong place and it takes multiple retransmissions for the
> receiver to detect duplicates and suspect that its SACKs are not
> getting through. If the receiver does not send SACKs to a
> different destination after receiving duplicates, or gets
> otherwise confused by the source address in the DATA packets
> then failover will be slowed. But this doesn't sound like what
> is happening because you said that you received no data at the
> receiver.
>
> Your description makes be believe that the former is the case:
> that, although it is retransmitting to an alternate destination
> address, the source address and originating interface is not
> changing. Set up your routing tables so that each destination
> address can only be reached by one interface and see if the
> problem disappears. Also watch out for default routes. You can
> try removing default routes for the purposes of testing. If the
> problem disappears it is a routing problem on retransmissions.
>
> Aside from that, slap Ethereal on both interfaces on both hosts
> and you should see what is happening.
>
> A better solution than cranking down RTO.min for increasing
> failover performance is CMT. At moderate data rates, failover
> is smooth and immediate and queuing delays impacted only by
> milliseconds on fast connections. Only a proportion of data is
> affected by interface/single-path failures and the data on the
> other paths remains completely unaffected.
>
> --brian
>
> On Fri, 11 Feb 2005, Randall Stewart wrote:
>
>
>>Anatoly:
>>
>>Shridhar can probably shed more light on this than I .. but
>>I have a couple of ideas.
>>
>>In theory the sctp stack SHOULD send retransmissions to the
>>alternate address.. I would think this is NOT happening.. Even
>>if it does happen for some time you will still have some
>>delays in messages... Think of it this way
>>
>>--msg-1--->
>>--msg---->
>>--msg-n-->
>>T.O (1sec)
>>--rsend- msg1-n to alternate
>>--msg-n+1-->
>>--msg-n+..->
>>--msg-n+m-->
>>T.O (2sec)
>>--resend msg(n-m) to alternate
>>--msg+m-1-->
>>
>>etc..
>>
>>until you get
>>
>>TO's
>>1sec
>>2sec
>>4sec
>>8sec
>>16sec
>>32sec
>>---
>>or about 63 seconds after failure the primary will be delcared dead
>>and new transmissions will go to the alternate (which is now
>>the primary).
>>
>>In between the break and the 6 timeouts, messages will be delayed
>>anywhere from 1 - 32 seconds .. but they should still get through.
>>
>>The fact that you don't see any messages for 60 or so second indicates
>>to me that maybe the retransmit to alternate is not working in lk-sctp,
>>or maybe there is an option to turn it on??
>>
>>In any event the only way to keep the network failover time
>>down is to set RTO.Max to a lower value.. that would make things
>>faster... To have a 1 second failover I would imagine that
>>Ulticom's stack is setting both RTO.Min and RTO.Max to a
>>lower value... aka that adds up to a total of 1 second..
>>I.e. something like 50ms RTO.Min and 400ms RTO.Max
>>
>>Now, as discused on the tsv, when you do this you need to make
>>sure the receiver is also cranking down its delayed sack timer
>>to be smaller than RTO.Min .. otherwise you are going to get
>>T3-Timeouts on normal sack delay when only one TSN is sent
>>and there is nothing else to send..
>>
>>Hope that helps..
>>
>>R
>>
>>Anatoly Khusid wrote:
>>
>>>Hello,
>>>
>>> I am using Linux SCTP implementation (LKSCTP) SLES9 (2.6.5-7.111.19-smp)
>>>distribution.
>>> I have a client application that is sending data to a server on a remote
>>> machine. The machines are connected over two private LANs.
>>> When I disconnect a primary interface, I expect SCTP to start using an
>>> alternative LAN as soon as possible. Well, it takes about one minute for
>>> LKSCTP to detect that LAN is down, before it starts to transmit data
>>> messages on another LAN. I don’t have any data messages lost, but I have a
>>> 1-minute delay during which time no data is received by a server, after
>>> about one minute, the data transmission resumes through the alternative
>>> interface.
>>> I am curious why it takes so long to detect a LAN failure? I am using all
>>> the defaults for SCTP provisioning. (In fact I used getsockopt() to verify
>>> that the defaults match SCTP specs).
>>> Based on this section in SCTP RFC I would expect the switchover to be in a
>>> matter of seconds the most. I am using Ulticom’s SCTP implementation and
>>> the switchover only takes about 1 second. Could anyone please shed some
>>> light on this?
>>>
>>> Section 6.4 of SCTP RFC 2960:
>>> Furthermore, when its peer is multi-homed, an endpoint SHOULD try to
>>> retransmit a chunk to an active destination transport address that is
>>> different from the last destination address to which the DATA chunk was
>>> sent.
>>>
>>> Thanks,
>>>
>>> Anatoly Khusid
>>> Ulticom Inc.
>>> Senior Software Engineer
>>>
>>
>>
>>--
>>Randall Stewart
>>ITD
>>803-345-0369 <or> 815-342-5222
>
>

-- 
Randall Stewart
ITD
803-345-0369 <or> 815-342-5222
Received on Fri Feb 11 14:22:19 2005

This archive was generated by hypermail 2.1.8 : Mon Mar 13 2006 - 15:22:23 EST