The e-mail and calendaring service known as Exchange (exchange.vt.edu) is provided by multiple servers, two of which act as access points for user mailboxes. On Friday, April 29, 2005, one of these servers (fangorn.cc.vt.edu) began experiencing high CPU utilization, resulting in reduced responsiveness and long delays in the processing of incoming and outgoing messages. The issue continued into Saturday, April 30 and Sunday, May 1. Although mail delivery was delayed and system response was slowed, due to the lower volume of traffic on weekends, users were still able to access the system.
While working with Microsoft support to analyze the issue on Sunday, Fangorn had a disk drive failure, causing a further delay in the investigation into the cause of the high CPU utilization. The disk drive was replaced and the system returned to a functional state later that same day.
Performance of the system degraded substantially on Monday, May 2, due to the continuing CPU utilization issues, coupled with the increased load associated with the beginning of the work week. During continued work with Microsoft support, the focus moved to the Anti-virus software that runs on the Exchange server (to prevent viruses from being sent or received by Exchange users). Analysis showed the majority of the CPU cycles were being consumed by virus scanning activities. On Microsoft's recommendation the anti-virus software was upgraded to the latest version and patch level, but this did not improve system performance.
On Tuesday, May 3, Microsoft began a more in-depth analysis of the messages that were taking inordinate amounts of time to be processed by the system. This analysis resulted in the identification of a known issue in the Exchange software that causes the system to expend the majority of CPU cycles in virus scanning activities. Microsoft recommended a change to the system configuration to avoid this problem. After this change was applied (at around 11:30 AM on Tuesday, May 3) the system began clearing the backlog of messages in queue and CPU utilization returned to normal levels.
The server was monitored closely for the remainder of the day Tuesday. Having observed successful operation of the service under the new configuration for several hours, the same configuration change was applied to the other servers that provide the Exchange service to preclude any potential for the same problem to occur elsewhere. E-mail continued to flow and CPU utilization levels remained in the normal range after all servers were reconfigured.
After the configuration of the Exchange service was changed, the root cause of the problem was discovered. During their analysis, Microsoft support engineers determined that a particular kind of message known as a Delivery System Notification was causing the problem. A Delivery System Notification is sent to you, for example, when you attempt to send an e-mail message to an invalid address. The configuration change prescribed by Microsoft would prevent excessive use of system resources in sending Delivery System Notifications, and this change effectively solved the problem. One question remained, however: What was causing the system to expend so much effort processing these notifications, when they are routinely sent with no adverse impact to the system?
Shortly after the Exchange service was restored to normal operating levels by the aforementioned configuration change, an unrelated component of the university's e-mail processing systems became overloaded. Systems administrators began investigating this new e-mail problem and soon concluded that what had been causing so much trouble on the Exchange service was a Delivery System Notification caught in a loop between the Exchange service and, a mis-configured e-mail handler. In its previously overloaded condition, the Exchange service could not deliver this notification. After the Exchange service was reconfigured, the notification was delivered by means of the university's e-mail processing network, causing overloads all along the path.
Administrators began reviewing the message queues on these devices and found several large messages (nearly 5 MB in size) that had resulted from the mail loop, early on the morning of May 4. These messages exhibited the symptoms described by Microsoft and had as many as 100 embedded Delivery System Notifications. These messages were deleted from the system, and by 8:00am on Wednesday morning all e-mail services returned to normal.