Thursday, May 3, 2018

Downtime

A few days ago I had an interesting case of a server downtime. The server is just a playground for developers, so no big deal. But still, lessons learned.

The reports came almost simultaneously from developers and from the monitoring system, "cannot connect". And indeed, the server was not pingable. Someone else's server, with IP equal to the IP of our server with the last octet increased by 2, was pingable, so I concluded it was not a network problem.

Next reaction: look at the server's screen, using remote KVM provided by the hoster. Kernel panic! OK, need to screenshot it (done) and reboot the server. Except that the Power Control submenu in the viewer is grayed out, so I can't. And a few months ago, when we needed a similar kind of reset, it was there.

OK, so I created a ticket for resetting the server manually. And I had to remind them that the remote reboot functionality is supposed to work. Here is the hoster's reply (PDU = power distribution unit):

Dear Alexander,

Upon checking on the PDU, the PDU is refusing connection.

We'll arrange a PDU replacement the soonest possible.

We apologise for the inconvenience caused.

Everybody reading this post, now, please check that you don't fall into the same trap. Run your iKVM viewer against each of your server that it can connect to, and check that it runs, and that the menu item to reset the server still exists. Create a calendar reminder to periodically recheck it.

And maybe append "panic=10" to your linux kernel command line, so that manual intervention is not needed next time.