4 am Phone Call – The answer is C.

Most IT related exams are multiple choice. I do remember my teacher once saying “When you get a call at 4 am in the morning, the correct answer isn’t C”.

Many years ago I was working for a company in Victoria, Australia. We had engaged an integrator to deploy a new WAN. The technology was GetVPN. This technology allows any site to talk to any other site via encrypted tunnels. These tunnels are not really tunnels but security associations between a source and destination. This technology is designed to be used over a MPLS network, a network which we call any to any connectivity. Traffic doesn’t need to pass through other sites, it can go straight to the destination.

This technology also relies on an underlying routing protocol to provide connectivity to all sites before the encryption takes place. I wont go into any more detail, but I will try to capture what happened that morning and the intense pressure we felt.

I don’t recall the date or day, but I do recall the time. 4 am.

My colleague was on call, and he was the lucky one to receive the first call. All sites down. Over 120 sites to be exact. This was not a normal outage, this must have been a change or something surely? What would take down such a network?

Now, maybe the configuration that was deployed was standard (actual Cisco configs from the configuration guide…..profile id 1234 lol) maybe, we had an underlying carrier issue? Nope. Sites were actually dropping in and out and getting the customer to reboot the router brought it back online…only for a few minutes.

The call came to me, I was the ‘one’ who had the best knowledge of GetVPN in the team. My colleague had worked out it was a GetVPN problem, an encryption problem. Yikes! That sounded technical and difficult. I spent many hours reading and learning about GetVPN when it was deployed at our workplace, but I still was no expert. See this technology relies on two very special routers, known as key servers. These guys are the backbone of the network, coordinating encryption keys to be handed out to every node. Depending on a configured time, they refresh and everyone is using the same key. If you have some nodes using one key and another using an older key, guess what happens? It’s like a Chinese person talking to Indian person, they can’t understand each others language.

We tried many things that morning, while I laid in bed on the phone, we rebooted key servers, got the customers to reboot any routers they could. Still nothing. That was about 5:30am. That was the time I decided we have to go into the office before everyone in the IT office gets there.

I got in the car and headed straight for work.

I was frantic in the car, still on the phone trying to work out what the hell happened!

We needed more information, we needed data to troubleshoot. Although we could not access the routers remotely to gather this data.

Arriving at work, it hit us. We have a lab. A sweet sweet lab and it was in our office. We could troubleshoot from here! We spent about 30 mins trying to debug and find the cause of the issue, but by this time it just all started to look the same. Managers started to come in, and demanded answers. They were not harsh, they were helpful but the entire WAN was down. We had to give constant updates…we are with TAC. Sites had to go to manual processes with absolutely no connection to the Data Centres. Phones didn’t work. No email no nothing. Imagine sitting at your house and you phone and internet was down and how annoyed you would be. Now multiple it by 120 sites and maybe at least 10 people per small site and 200 + at 5 large sites across the country. Is the correct answer C yet?

The next step after gathering all the logs, was Cisco TAC. This is technical support, from Cisco themselves. The experts.

I made the call and we got a guy from Texas. He was a GetVPN expert.

He was able to connect to my PC via Webex and found our first problem. Encryption was broken and when this happens you need to make sure that certain protocols in GetVPN are not encrypted, ever. This is so you can build the underlying connectivity using a routing protocol and also in case of an encryption problem you can still manage the routers.

Routing updates, ping and SSH should not be encrypted. SSH is already encrypted anyway. We modified this on the key servers and suddenly we had SSH access to all routers. More troubleshooting continued.

He found the problem, thank god.

Colleague performed a change on all remote routers a few days before. It was to update SSH keys for remote routers. Although by accident it included the key servers. It took the current generated keys used for both SSH and GetVPN encryption and removed them. During the morning connectivity was lost between the two key servers and they both became master. Then connectivity restored, but it was too late. The remote routers were still using the old keys from one key server and connectivity was lost. (as you can imagine it was even more technical than this, but this is all I remember).

Did you ask about Change Control? Yeah it was followed for the SSH key generation, but to be honest even with all my reading I still knew jack shit about GetVPN. The only way I really learned was when it broke.

So….make sure you lab things. Make sure you get your hands dirty, even if it is a virtual lab. It is the only way you will learn anything in life to be honest. Don’t be afraid to break things in the lab, watch what it does when it breaks and what it does when you restore it.

Don’t ever be afraid to ask for help, you will always learn something. Don’t ever give up either, if it has been broken then it can be fixed and don’t memorize the answers A,B,C & D because the question hasn’t been written yet in the real world!

So, the correct answer was not C, not all of the above or even phone a friend! It was when you deep down in the shit, escalate and ask for help. No one can be an expert at everything!



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s