Re: SCTP failover

From: Randall Stewart <rrs@cisco.com>
Date: Fri Feb 11 2005 - 13:08:06 EST
('binary' encoding is not supported, stored as-is) ('binary' encoding is not supported, stored as-is) Anatoly:

Shridhar can probably shed more light on this than I .. but
I have a couple of ideas.

In theory the sctp stack SHOULD send retransmissions to the
alternate address.. I would think this is NOT happening.. Even
if it does happen for some time you will still have some
delays in messages... Think of it this way

--msg-1--->
--msg---->
--msg-n-->
T.O (1sec)
--rsend- msg1-n to alternate
--msg-n+1-->
--msg-n+..->
--msg-n+m-->
T.O (2sec)
--resend msg(n-m) to alternate
--msg+m-1-->

etc..

until you get

TO's
1sec
2sec
4sec
8sec
16sec
32sec

---
or about 63 seconds after failure the primary will be delcared dead
and new transmissions will go to the alternate (which is now
the primary).
In between the break and the 6 timeouts, messages will be delayed
anywhere from 1 - 32 seconds .. but they should still get through.
The fact that you don't see any messages for 60 or so second indicates
to me that maybe the retransmit to alternate is not working in lk-sctp,
or maybe there is an option to turn it on??
In any event the only way to keep the network failover time
down is to set RTO.Max to a lower value.. that would make things
faster... To have a 1 second failover I would imagine that
Ulticom's stack is setting both RTO.Min and RTO.Max to a
lower value... aka that adds up to a total of 1 second..
I.e. something like 50ms RTO.Min and 400ms RTO.Max
Now, as discused on the tsv, when you do this you need to make
sure the receiver is also cranking down its delayed sack timer
to be smaller than RTO.Min .. otherwise you are going to get
T3-Timeouts on normal sack delay when only one TSN is sent
and there is nothing else to send..
Hope that helps..
R
Anatoly Khusid wrote:
> 
> Hello,
> 
>  I am using Linux SCTP implementation (LKSCTP) SLES9 (2.6.5-7.111.19-smp)
> distribution.
>  I have a client application that is sending data to a server on a remote
>  machine.  The machines are connected over two private LANs.
>  When I disconnect a primary interface, I expect SCTP to start using an
>  alternative LAN as soon as possible.  Well, it takes about one minute for
>  LKSCTP to detect that LAN is down, before it starts to transmit data
>  messages on another LAN.  I don’t have any data messages lost, but I have a
>  1-minute delay during which time no data is received by a server, after
>  about one minute, the data transmission resumes through the alternative
>  interface.
>  I am curious why it takes so long to detect a LAN failure?  I am using all
>  the defaults for SCTP provisioning. (In fact I used getsockopt() to verify
>  that the defaults match SCTP specs).
>  Based on this section in SCTP RFC I would expect the switchover to be in a
>  matter of seconds the most.  I am using Ulticom’s SCTP implementation and
>  the switchover only takes about 1 second.  Could anyone please shed some
>  light on this?
> 
>  Section 6.4 of SCTP RFC 2960:
>  Furthermore, when its peer is multi-homed, an endpoint SHOULD try to
>  retransmit a chunk to an active destination transport address that is
>  different from the last destination address to which the DATA chunk was
>  sent.
> 
>  Thanks,
> 
>  Anatoly Khusid
>  Ulticom Inc.
>   Senior Software Engineer
> 
-- 
Randall Stewart
ITD
803-345-0369 <or> 815-342-5222
Received on Fri Feb 11 13:11:32 2005

This archive was generated by hypermail 2.1.8 : Mon Mar 13 2006 - 15:22:23 EST