>From oppermann@pipeline.ch Fri Sep 21 00:49:02 2001 Return-Path: Delivered-To: andre@pipeline.ch Received: (qmail 64937 invoked by uid 1111); 20 Sep 2001 22:49:02 -0000 Delivered-To: info@pipeline.ch Received: (qmail 64930 invoked from network); 20 Sep 2001 22:49:01 -0000 Received: from unknown (HELO pipeline.ch) ([62.48.21.22]) (envelope-sender ) by mailtoaster1.pipeline.ch (qmail-ldap-1.03) with SMTP for ; 20 Sep 2001 22:49:01 -0000 Message-ID: <3BAA7239.9E89A697@pipeline.ch> Date: Fri, 21 Sep 2001 00:48:25 +0200 From: Andre Oppermann X-Mailer: Mozilla 4.76 [en] (Windows NT 5.0; U) X-Accept-Language: en MIME-Version: 1.0 To: info@pipeline.ch Subject: Explanation of Service Interruptions on Tuesday Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Mozilla-Status2: 00000000 Status: RO Dear Internet Pipeline Customers This is an information about the Service Interruptions on last Tuesday. What happened: On last Tuesday, 18.9.2001, at approximately 16:00 MET the Virus Nimda hit the Internet. At approx. 18:00 some Customer Servers in the Internet Pipeline Server Housing got infected. The Virus does two things: 1. It scans neighboring IP ranges and 2. tries to replicate itself further. This behavior caused a massive surge of TCP traffic on the Pipeline network. The normal routing level is around 500 Packets per Second (Server Housing). During the evening of last Tuesday went beyond 8'000 Packets per Second (incoming and outgoing). Such a high load overwhelmed the router connecting the Server Housing (customer.pipeline.ch). This in turn this caused major instabilities in the internal OSPF routing and led to packet ping-pong between the core1 and core2 router connecting to our backbone upstreams (Nextra, TIX and Colt). Here the core2 router also broke down and caused parts of the Internet to loose connectivity to Pipeline. Other parts were still reachable through core1 and Colt. The situation after 18:00 was that Servers in the Server Housing be- came virtually unreachable due to high traffic and router outages. Other parts such as leased line customers and ADSL customers did not reach many parts of the Internet but were not entirely disconnected. At approx. 18:15 we've got notified via SMS alarm service. Finding out the problem was very complex because of the number of systems affected. First it looked like a DoS attack launched from one of the Servers in Server Housing. We then tried to identify the machine causing this. While we succeeded in finding one of the infected machines and taking it offline the traffic levels did not go significantly down. This puzzled us big time and caused major headaches. This guessing and tracking took around 2 hours. At approx. 20:30 we got access to the first reports of the Nimda Virus and it's behavior. Now it became obvious that we had no real chance in tracking all infected machines down and taking them offline. Also much of the traffic were scan probes from outside. So instead we started to refocus our efforts to stabilizing the network and routers. This involved tuning the routers to handle this high number of Packets per Second. By approx. 23:00 the network stabilized and with even more information about what was going on on the Internet we could reach a fully stable state at 23:30. What do we do about it: On the technical side we decided to replace the customer.pipeline.ch router with a more powerful machine and also upgrade to a newer software version being capable of doing rate limiting, traffic shaping and fair queuing. The rate limiting will limit the number of Packets per Second per individual Server in the Server Housing. This value will be set quite high so if you hit high usage it not limit you. The value will be 1'000 Packets per Second for each server. So DoS or Flooding are limited and will cause only minor service degradation to the other servers. In case this limit is reached we will be informed so we can react. Also the core1 and core2 routers will receive a software upgrade to make them handle such higher loads without problems anymore. On the procedural side we decided to establish a phone number with a voice message. In case of a problem we will put a new text on the voice message informing about the nature of the problem and our estimated to fix. This message will be updated with new information as we know more about a specific problem. Then as well an email notification will be done (but this obviously does not help in case of a major problem). An interesting idea has been to send an SMS to the affected parties. We are currently researching the optimal software for sending the same SMS to many recipients. Of course this software has to work on a phone line and not over the Internet. The replacements and upgrades shall be done during next week. You will receive an advance notice. On a side note. Due to the big success of our ADSL products we will upgrade our ADSL LNS router before schedule until middle of October. The new system is a Cisco 7206VXR with an NSE-1 and 256MB of RAM and will replace the current Cisco 2650 (which is not limited by CPU power or bandwidth but the IOS software does not allow for such many connections on this platform). Best Regards -- Andre Oppermann Internet Pipeline AG