Thursday, February 18, 2016

Server Down

Subtitle: Don't Panic

This is a short posting that will appeal to like minded IT professionals or perhaps anybody with a passion and the diabolical consequences of their /baby/ being down.

In the simplest of terms one IT professional to another:

Server Down 

If you are not into IT here are some analogs.

- You are a car person and your favourite car suddenly breaks
- You are a passionate cyclist and your bicycle gets crunched
- You are a runner and suddenly, you develop severe pains preventing all running

Do you get the idea?

Well, the moderate news is that I'm typing this from my restarted and now only partially sick main workstation / server.   She is up and running, well more like walking ....

So what happened?
I was /just/ performing a multiple hardware upgrade that I had of course researched exhaustively and then purchased what I had thought would be all the required components.

But 4 hours later, our main server was still being rewired, and reconfigured, and cleaned and when it tried to start it,  at first did not ....

IT Solidarity

It is now 8 hours later.  Things are stable. Dear IT brothers and sisters .... do you recognise ....

Why ITs Different
IT is different to many other professions because it is now mostly 24x7.  So when you tinker, or upgrade, then users, or your boss, or even yourself do not like to stray out of the agreed change window.

You don't just close the door at 17.00 and come back tomorrow.   You work until you fix it!

Not all Eggs in One Basket
Our IT configuration is of course distributed.  I mean we have backup systems, and other critical 24x7 systems like the Webserver continue to function.

Configuration Diagrams
There are configuration files to determine what was where before I started board swapping, but most of them are on this master server that I'm fiddling with!  Luckily there is a backup, and so after locating that, and decrypting it, I managed to find out what the initial upgrade PCIe card placement before the swapping and more swapping.

US Keyboard
Whilst systems are internationalised these days in theory, I had to fall back on the US keyboard, with a cable, i.e. not wireless, and not Swiss, as shown in the photo!

US Keyboard with PS2 Plug
This ancient system even has PS/2 keyboard plugs and I had a suitable keyboard standing by.  Some Server BIOS seem to be a little fussy on startup.  So I always keep both sets of wired US keyboards  (USB and PS/2) is an emergency box

A Mirror
At some point in the panic rewiring in tight spaces was required.  I had to use a make up mirror  (from Agata ) and my Petzl Nao headtorch

A Voltmeter
The voltmeter is useful to check that some of the motherboard voltages really were correct. It was a cross reference of the manuals and checking the flashing LEDS and yes .. I didn't blow anything ... yet.

A Vacuum Cleaner
When you do the power-on and no fan even spins it's normally time to get back to basics.  And this includes dust removal from every orifice, and loose connection checking everywhere.   If after that nothing works, boy are you in trouble!

No matter what the crisis sometime you have to step out for coffee, or multiple coffees, and perhaps a snack, and discuss the situation with a friend.  Hopefully not somebody who is ever going to utter the phrase 

/Well, I told you so/

Patience & Assessment
If you ever have a total server down, or just a nothing works when you re power on, or if a few TB of data seems to have disappeared.

You need some patience to really assess the situation.  Don't make any rash decisions and try and talk thru the options

Small Hands
At least one time I had to seek assistance of Agata.  My very patient at rather clever partner.  She also has very small hands, and they can be useful for those critical rewires

Backout Plan But
Even if you have a backout plan you can always think of a number of good reasons why that is totally ridiculous.  Normally along the lines of :you've almost got it working and you /just/ need a little more time.   Hopefully your project plan for upgrade included the /yes but/ additional task time.  Meaning that you can eat into that before backing everything out.

I've been pretty non specific as to what was upgraded, but let's just say that I'll report it in a future post.  Suffice to say, that our principal server is back up,  Storage Space data is recovered, it is 2am in the morning, and so after about 9 hours of huffing and puffing, things are looking good.

Paul Rodgers: All right now