CNS Website Home Page

Exchange problems: what happened and why

The e-mail and calendaring service known as Exchange (exchange.vt.edu) is provided by multiple servers, two of which act as access points for user mailboxes. On Friday, April 29, 2005, one of these servers (fangorn.cc.vt.edu) began experiencing high CPU utilization, resulting in reduced responsiveness and long delays in the processing of incoming and outgoing messages. The issue continued into Saturday, April 30 and Sunday, May 1. Although mail delivery was delayed and system response was slowed, due to the lower volume of traffic on weekends, users were still able to access the system.

While working with Microsoft support to analyze the issue on Sunday, Fangorn had a disk drive failure, causing a further delay in the investigation into the cause of the high CPU utilization. The disk drive was replaced and the system returned to a functional state later that same day.

Performance of the system degraded substantially on Monday, May 2, due to the continuing CPU utilization issues, coupled with the increased load associated with the beginning of the work week. During continued work with Microsoft support, the focus moved to the Anti-virus software that runs on the Exchange server (to prevent viruses from being sent or received by Exchange users). Analysis showed the majority of the CPU cycles were being consumed by virus scanning activities. On Microsoft's recommendation the anti-virus software was upgraded to the latest version and patch level, but this did not improve system performance.

On Tuesday, May 3, Microsoft began a more in-depth analysis of the messages that were taking inordinate amounts of time to be processed by the system. This analysis resulted in the identification of a known issue in the Exchange software that causes the system to expend the majority of CPU cycles in virus scanning activities. Microsoft recommended a change to the system configuration to avoid this problem. After this change was applied (at around 11:30 AM on Tuesday, May 3) the system began clearing the backlog of messages in queue and CPU utilization returned to normal levels.

The server was monitored closely for the remainder of the day Tuesday. Having observed successful operation of the service under the new configuration for several hours, the same configuration change was applied to the other servers that provide the Exchange service to preclude any potential for the same problem to occur elsewhere. E-mail continued to flow and CPU utilization levels remained in the normal range after all servers were reconfigured.

After the configuration of the Exchange service was changed, the root cause of the problem was discovered. During their analysis, Microsoft support engineers determined that a particular kind of message known as a Delivery System Notification was causing the problem. A Delivery System Notification is sent to you, for example, when you attempt to send an e-mail message to an invalid address. The configuration change prescribed by Microsoft would prevent excessive use of system resources in sending Delivery System Notifications, and this change effectively solved the problem. One question remained, however: What was causing the system to expend so much effort processing these notifications, when they are routinely sent with no adverse impact to the system?

Shortly after the Exchange service was restored to normal operating levels by the aforementioned configuration change, an unrelated component of the university's e-mail processing systems became overloaded. Systems administrators began investigating this new e-mail problem and soon concluded that what had been causing so much trouble on the Exchange service was a Delivery System Notification caught in a loop between the Exchange service and, a mis-configured e-mail handler. In its previously overloaded condition, the Exchange service could not deliver this notification. After the Exchange service was reconfigured, the notification was delivered by means of the university's e-mail processing network, causing overloads all along the path.

Administrators began reviewing the message queues on these devices and found several large messages (nearly 5 MB in size) that had resulted from the mail loop, early on the morning of May 4. These messages exhibited the symptoms described by Microsoft and had as many as 100 embedded Delivery System Notifications. These messages were deleted from the system, and by 8:00am on Wednesday morning all e-mail services returned to normal.

Time Outline

  1. Began migration of users from Mac Outlook client to Entourage client on February 1, 2005.
    • Due to the change in the manner in which users' files were stored (moving from the MS MAPI protocol to the open IMAP protocol), storage associated with both mailboxes and log files increased in an unanticipated and rapid fashion. Some users' mailboxes were moved, and storage was added to accommodate this growth.
    • During the movement of mailboxes, one of the storage groups grew too large and overflowed its file system limits. This caused a temporary disruption of service. The migration is nearly complete, but remains ongoing.
  2. On Friday, April 29th, an unrelated problem surfaced on Fangorn, one of two Exchange e-mail servers with its own unique set of users. It began experiencing chronic CPU overload problems which slowed e-mail delivery and retrieval for users associated with that server. Exchange services provided by the second Exchange server, Rivendell, with its own unique set of users, continued normally.
  3. Various steps were taken to address the Fangorn CPU load problems over the weekend, April 30th and May 1st, but none brought about a full recovery from the problems. One of the steps taken to address the system slow down was to update the Anti-virus software on the Exchange server to its latest version. Microsoft software engineering support suggested this due to a known issue with such software and what had been seen in system dumps provided to them.
  4. During this procedure, one of the hard drives attached to Fangorn failed causing a hardware related system panic. This further delayed the analysis and resolution of the performance issue.
  5. At 7am, Monday, May 2nd, consultation with Microsoft (MS) continued. Once the hardware problem was resolved, the queues slowly built back up until the system performance was once again near 100% busy on a consistent basis. A variety of issues were investigated, including upgrading the Anti-virus software. Performance problems continued, however.
  6. Tuesday, May 3rd, a MS software engineer arrived in Blacksburg to participate in the resolution effort on-site. Phone support engineers recommended a Registry change affecting the number of embedded messages in a single piece of e-mail. Once the Information Store was halted and restarted around 12 noon, the queues begin to empty out and the CPU busy level began to stabilize.
  7. By early afternoon of May 3rd, Exchange e-mail performance began to return to normal from an end-user's perspective and e-mail backlogged over the period began to be delivered. The anti-virus service was reactivated around 2pm.
  8. The mail notes that precipitated the problem on the Exchange server then flowed through to the Sun ONE (pop.vt.edu) server where they were intercepted by the virus scanning devices. This caused a backlog on the Sun ONE server as queues formed on the virus scanners while they processed the notes.
  9. Corrective action was taken to address the load problems with the Sun ONE server and as of late evening May 3rd/early morning May 4th, both POP and Exchange e-mail services had returned to normal operation.