>From oppermann@pipeline.ch Fri Sep 21 00:49:02 2001
Return-Path: <oppermann@pipeline.ch>
Delivered-To: andre@pipeline.ch
Received: (qmail 64937 invoked by uid 1111); 20 Sep 2001 22:49:02 -0000
Delivered-To: info@pipeline.ch
Received: (qmail 64930 invoked from network); 20 Sep 2001 22:49:01 -0000
Received: from unknown (HELO pipeline.ch) ([62.48.21.22]) (envelope-sender <oppermann@pipeline.ch>)
          by mailtoaster1.pipeline.ch (qmail-ldap-1.03) with SMTP
          for <info@pipeline.ch>; 20 Sep 2001 22:49:01 -0000
Message-ID: <3BAA7239.9E89A697@pipeline.ch>
Date: Fri, 21 Sep 2001 00:48:25 +0200
From: Andre Oppermann <oppermann@pipeline.ch>
X-Mailer: Mozilla 4.76 [en] (Windows NT 5.0; U)
X-Accept-Language: en
MIME-Version: 1.0
To: info@pipeline.ch
Subject: Explanation of Service Interruptions on Tuesday
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Mozilla-Status2: 00000000
Status: RO

Dear Internet Pipeline Customers

This is an information about the Service Interruptions on last Tuesday.

What happened:

 On last Tuesday, 18.9.2001, at approximately 16:00 MET the Virus Nimda
 hit the Internet.

 At approx. 18:00 some Customer Servers in the Internet Pipeline Server
 Housing got infected.

 The Virus does two things: 1. It scans neighboring IP ranges and 2.
 tries to replicate itself further.

 This behavior caused a massive surge of TCP traffic on the Pipeline
 network. The normal routing level is around 500 Packets per Second
 (Server Housing). During the evening of last Tuesday went beyond
 8'000 Packets per Second (incoming and outgoing).

 Such a high load overwhelmed the router connecting the Server Housing
 (customer.pipeline.ch). This in turn this caused major instabilities
 in the internal OSPF routing and led to packet ping-pong between the
 core1 and core2 router connecting to our backbone upstreams (Nextra,
 TIX and Colt). Here the core2 router also broke down and caused parts
 of the Internet to loose connectivity to Pipeline. Other parts were
 still reachable through core1 and Colt.

 The situation after 18:00 was that Servers in the Server Housing be-
 came virtually unreachable due to high traffic and router outages.
 Other parts such as leased line customers and ADSL customers did not
 reach many parts of the Internet but were not entirely disconnected.

 At approx. 18:15 we've got notified via SMS alarm service.

 Finding out the problem was very complex because of the number of
 systems affected. First it looked like a DoS attack launched from
 one of the Servers in Server Housing. We then tried to identify the
 machine causing this. While we succeeded in finding one of the
 infected machines and taking it offline the traffic levels did not
 go significantly down. This puzzled us big time and caused major
 headaches. This guessing and tracking took around 2 hours.

 At approx. 20:30 we got access to the first reports of the Nimda
 Virus and it's behavior. Now it became obvious that we had no
 real chance in tracking all infected machines down and taking them
 offline. Also much of the traffic were scan probes from outside.

 So instead we started to refocus our efforts to stabilizing the
 network and routers. This involved tuning the routers to handle
 this high number of Packets per Second.

 By approx. 23:00 the network stabilized and with even more
 information about what was going on on the Internet we could
 reach a fully stable state at 23:30.


What do we do about it:

 On the technical side we decided to replace the customer.pipeline.ch
 router with a more powerful machine and also upgrade to a newer
 software version being capable of doing rate limiting, traffic
 shaping and fair queuing.

 The rate limiting will limit the number of Packets per Second per
 individual Server in the Server Housing. This value will be set
 quite high so if you hit high usage it not limit you. The value
 will be 1'000 Packets per Second for each server. So DoS or Flooding
 are limited and will cause only minor service degradation to the
 other servers. In case this limit is reached we will be informed
 so we can react.

 Also the core1 and core2 routers will receive a software upgrade
 to make them handle such higher loads without problems anymore.

 On the procedural side we decided to establish a phone number with
 a voice message. In case of a problem we will put a new text on the
 voice message informing about the nature of the problem and our
 estimated to fix. This message will be updated with new information
 as we know more about a specific problem.

 Then as well an email notification will be done (but this obviously
 does not help in case of a major problem).

 An interesting idea has been to send an SMS to the affected parties.
 We are currently researching the optimal software for sending the
 same SMS to many recipients. Of course this software has to work
 on a phone line and not over the Internet.


 The replacements and upgrades shall be done during next week. You
 will receive an advance notice.

 On a side note. Due to the big success of our ADSL products we will
 upgrade our ADSL LNS router before schedule until middle of October.
 The new system is a Cisco 7206VXR with an NSE-1 and 256MB of RAM and
 will replace the current Cisco 2650 (which is not limited by CPU
 power or bandwidth but the IOS software does not allow for such many
 connections on this platform).


Best Regards
-- 
Andre Oppermann

Internet Pipeline AG