Using Systems Monitoring to Improve IT Operations

I was meeting with a potential client when we were interrupted by the help desk manager.  “Email is down again!  Users have been calling telling us they can’t access their inbox.”  While it was clear this meeting was about to end, I knew we could help this company’s IT organization move from reactive to proactive and avoid such issues.  When the help desk is the first line of defense for reporting outages, there is plenty of opportunity for improvement.

When it comes to system monitoring, most organizations fall into one of four categories:

  • End-user monitoring – relying on end users to tell IT when systems aren’t working
  • Needle in a haystack – using a simple tool to “ping” a device; but this only tells us if the device is available, not if it’s actually servicing business transactions
  • There’s a tool for that – allowing every IT person or team to pick their own best-of-breed tool to monitor the one thing they care about (so no one sees the big picture)
  • Comprehensive monitoring – enabling an IT organization to see the entire end-to-end state of IT systems, understand the impact of an outage, and most importantly, predict issues before they impact business transactions

Nearly all IT organizations have adopted server virtualization technologies to increase flexibility and reduce costs.  Virtualization is easy to dynamically scale and enables organizations to squeeze more savings out of existing hardware and avoid “over buying” for peak capacity or growth that may never come.

But, it can have unintended consequences for reactive IT organizations.  Should additional capacity be required, the necessary “headroom” may no longer be available, and end-user performance may be impacted.  In the case of my prospective client, the email server did not crash; it simply shut-down because a database volume ran out of disk space.  Five minutes of non-intrusive effort would have avoided this downtime if system administrators had been alerted to a critical disk space constraint.

And, while server virtualization decreases the number of physical servers installed, it also encourages server sprawl.  Now that an IT organization can spin up a new server in minutes without the need for a purchase order, the total number of servers is increasing at a rapid pace.  This can help IT organizations respond to business requests, but it also increases the overall complexity of the environment and the ability to understand the total impact of a failure without a post-mortem review of help desk tickets. 

Both of these unintended consequences can be solved by deploying robust systems monitoring tools.  Previously, these tools required sizable investments and dedicated IT staff, but the landscape has changed.  Tools like Microsoft’s System Center Operations Manager (SCOM) allow companies to use a single, extensible platform to monitor all systems within the data center and across the organization (even down to the individual desktop, if desired) at a price point that can fit any size organization.

These tools, by providing end-to-end visibility, enable IT management teams to define IT services (e.g., business transactions, applications, virtual servers, network devices, etc.) and see the impact, should any one of those components fail.  More importantly, they allow IT management to focus incident response based upon root-cause and impact to business operations.   Systems Center Operations Manager is well suited for this task because its robust integration platform allows dozens of vendors to introduce hundreds of “management packs” that add heterogeneous IT devices to the scope of the monitoring system.  Through partner integrations, SCOM can monitor everything from network bandwidth utilization to storage area networks to VMware to power management devices and more.

Monitoring tools also allow IT organizations to forecast and perform true capacity planning.   Through use of historical data, IT administrators can see trends related to disk utilization, system performance, and other attributes that will allow them to increase capacity just before its required, not months and years before its necessary or weeks after the business suffered due to performance degradation.

Monitoring tools can be deployed in a manner of weeks, not months, giving the IT organization a quick return on its investment.  West Monroe Partners has developed a methodology that helps ensure these deployments are successful by using a series of alert and threshold tuning exercises so real alerts are not lost in the white-noise of excessive alerts.

To learn more about systems management and Microsoft System Center Operations Manager, please contact Nate Ulery.