Sponsored By

The Case for Periodic Infrastructure ReviewsThe Case for Periodic Infrastructure Reviews

Organizations continue to discover that their infrastructure contains vulnerabilities that can take it down for hours or days. When was your network last reviewed? Why wait for a failure?

Terry Slattery

March 3, 2015

5 Min Read
No Jitter logo in a gray background | No Jitter

Organizations continue to discover that their infrastructure contains vulnerabilities that can take it down for hours or days. When was your network last reviewed? Why wait for a failure?

All Systems Down
Is your network about to crash? How do you know it isn't? When was the last time that it was reviewed? Think of an infrastructure review as the equivalent to the 120-point automobile inspection.

Read about a major outage in All Systems Down, which appeared in CIO magazine in 2003. In the article, John Halamka, CIO of Beth Israel Deaconess Medical Center, describes dealing with a four-day network outage. It is quite an interesting article because it goes into great detail about what happened and his team's steps to get the network running again. The summary is that it started with a massive spanning tree forwarding loop that consumed network bandwidth and eventually caused network devices to crash.

What does this 12-year-old article have to do with today's networks? Well, events like it continue to happen. Paul Whimpenny, Senior Officer for IT Architecture in the IT Division of the Food and Agriculture Organization of the United Nations, describes a similar network outage in Our bullet-proof LAN failed. Here's what we learned. Fortunately, Whimpenny's outage was only four hours.

Common to both outages was a spanning tree problem. Spanning tree network design is one of the key network functions that we include in our network assessment. (I use the term "our" in reference to NetCraftsmen, the consulting company that employs me. I created the first version of our network assessment process and draft report template a good number of years ago. Automated tools help streamline the network data collection and analysis process.)

Think about it. When was the last time your network and UC infrastructure was reviewed? A good review is actually a detailed audit of the network and UC infrastructure. It should examine the design, operational data, and operations. The result should be an identification of things that are working correctly, as well as the areas that need review and remediation.

Why Failures Happen
One of the things we look for in an assessment is whether the spanning tree design is actually making redundant data centers into a larger, single, distributed data center. Problems in one data center can be propagated by the protocol to the other data center. Visually, this looks like a barbell design. Each data center is a weight on the ends of the link that connects them to each other. That's probably not what was intended. In fact, it is often the result of the network growing and changing over time.

Another common source of outage is failed redundancy. A network will be designed and built with redundant elements and links. But then a redundant component will fail, and because the system is very resilient, the failure doesn't cause an outage. If network and UC monitoring systems are not in place, not properly configured, or not used on a regular basis, the failure isn't noticed. It is only when the second failure occurs that the first failure is found. It is common to find that the first failure occurred months or weeks before the second failure. There was plenty of time to correct the first failure and avoid an outage, if it had only been discovered in time.

On occasion, an infrastructure review of ours will find a network that is like an old farmhouse. It started as a one- or two-room building. Then, as the family grew, rooms and wings were added onto the existing structure. To reach one bedroom, you have to walk through another bedroom. The "old farmhouse" networks are similar. They often include single points of failure, where one part of the network connects to the core of the network via a single path. In many cases, this was the expedient way to provide network connectivity that was previously not planned. When asked about the lack of redundancy, many of these network administrators say that they intended to go back and correct it, but have not had time or they had simply forgotten about it.

I've also seen network problems created because the network staff misunderstood some operational data and installed a configuration that exacerbated a problem. A good example of this is configuring too many buffers on an interface that's dropping packets.

Operations
Network operations figures into almost every network failure. Occasionally, a fundamental design flaw causes a problem, but most often, it is a lapse in running the network that allows a failure to create an outage. Policies, processes, and procedures are key to good operations. If you think that each of these three things are the same or at least very similar, take a look at this link for a description of them.

For example, a good design policy is to not extend Layer 2 networking between data centers. Violations of this policy contributed to the failure that Whimpenny experienced and probably was also a factor in the Beth Israel Deaconess Medical Center outage. Policies should cover many design principles as well as when and how to enact processes and procedures. They are the rules for designing and running the network. Processes are what to do when something needs to be done while procedures are the steps that must be followed to implement a process. Knowing the process for breaking spanning tree loops and the procedure to follow, with specific staff assigned to perform those steps would have helped with both of the above problems. Procedures are the specific steps to follow and who should be performing those steps.

One operations idea that I've rarely seen in networking is failure testing. When was the last time that redundancy failover was tested in your network? This means taking down a major device or link and verifying that the redundant infrastructure works as designed. In a well-designed network, there will be no outage. Routing will automatically switch to the backup path with little or no packet loss.

For more information about UC infrastructure, attend the Enterprise Connect session "Preparing Your Infrastructure for UC" with Terry Slattery and John Bartlett on Monday, March 16, 2015 at 2pm. Register with code NJSPEAKER to get $300 off Entire Event or Tues-Thurs pass.

About the Author

Terry Slattery


Terry Slattery is a Principal Architect at NetCraftsmen, an advanced network consulting firm that specializes in high-profile and challenging network consulting jobs.  Terry works on network management, SDN, network automation, business strategy consulting, and network technology legal cases. He is the founder of Netcordia, inventor of NetMRI, has been a successful technology innovator in networking during the past 20 years, and is co-inventor on two patents. He has a long history of network consulting and design work, including some of the first Cisco consulting and training. As a consultant to Cisco, he led the development of the current Cisco IOS command line interface. Prior to Netcordia, Terry founded Chesapeake Computer Consultants, a Cisco premier training and consulting partner.  Terry co-authored the successful McGraw-Hill text "Advanced IP Routing in Cisco Networks," is the second CCIE (1026) awarded, and is a regular speaker at Enterprise Connect. He blogs at nojitter.com and netcraftsmen.com.