After experiencing a 12 hour network outage that made the evening news, one carrier learned how Tektronix Communications’ solutions can identify the root cause and prevent it happening again.
80 percent of network node failures in the core are caused by human change control errors.
Network outages are always a concern, but when they last 12 hours and become a news event, the cost to a carrier’s reputation and bottom line grows exponentially. This was the challenge facing our customer, a carrier who experienced a large area outage due to a “routine” software change to a network router.
The software change was conducted during the late night “maintenance window,” to minimize the impact on customers. During the change, the carrier’s data engineer made a keystroke error in a router located at a key distribution point in the network. Network equipment key performance indicators, however, did not reflect any trouble.
By 8:00 a.m., customers were unable to access the internet or any data services. The outage made the news that evening. The carrier determined the network outage was related to the router upgrade. After backing out the changes they were able to restore the network later that day. In total, customers were without data connectivity for 12.5 hours.
The carrier performed a two day root cause analysis to learn from the incident. They attributed the problem to an error in the provider edge (PE) router, which was blocking critical traffic. The PE router serves as the main link between the carrier’s network distribution sites across the area.
With the goal of avoiding an outage like this in the future, the carrier reached out to Tektronix Communications for information on their core network solutions. Tektronix Communications was able to show the power of the solution, and how it would have alerted the operator to the router issue almost immediately.
One minute after the software upgrade, the solution reported a 96% Create Data Session failure rate (see figure 1) and a massive spike in data latency (see figure 2) for network subscribers. Had the carrier implemented the Tektronix Communications solution, they could have corrected the router issue before it became a newsworthy event.
Figure 1: Create data session failure rate
Figure 2: Data latency
I didn’t know I had this type of fault isolation capability right in front of me, and that is so easy to use. Carrier data engineer
Network outages are often caused by human error. Unfortunately, many network errors go undetected by standard network equipment KPIs. Tektronix Communications performance troubleshooting suite provides a higher level of KPI granularity than any other solution in the industry, right down to one-minute increments. This type of detailed reporting can help operators detect issues before they lead to major network problems.