Back in business!

July 9th, 2014, 09:59 PM

Big Mike

I have repaired the second server.

However, I am keeping it off-line for now until I can work out a better solution. I just want to say I am pissed the data center couldn't do this in 15 hours, what I did in 5 minutes. Really pissed.

Mike

Did they make a service level agreement of sorts that applies to the situation? You could receive compensation for damages.

July 9th, 2014, 10:09 PM

artemiso

Did they make a service level agreement of sorts that applies to the situation? You could receive compensation for damages.

No. The network was fine and I own the servers, not them. I'm responsible.

Sent from my LG Optimus G Pro

July 9th, 2014, 10:13 PM

I will post a full accounting tomorrow.

Sent from my LG Optimus G Pro

July 10th, 2014, 12:05 PM

Synopsis

- July 8 at 7:22pm Eastern, both servers rebooted for security patch maintenance
- phoenix came back fine
- atlantis did not, missing bootloader

Atlantis had a drive failure in the R1 array a couple weeks ago, and apparently when resynced on the new drive the bootloader partition was somehow missed, causing the system to not be bootable.

I immediately tried to reinstall grub, but the problem is that the mdadm array was occupying 100% of both sda and sdb with no available partition (even 1MB) to install grub. And XFS cannot be shrunk, so I couldn't shrink the array.

I then tried to use rescue tools to get grub to install to another media and boot atlantis. Unfortunately, I needed the data center to help physically with this. And most unfortunately, I had to go to bed immediately because I had an unalterable appointment in Guayaquil at 9am and 3 hour drive away that I had to wake up at 4am for the next morning (part of my visa process).

The data center told me they could fix this, their linux sysadmin could do it "shortly". I went to bed thinking it would be fixed within an hour or so. Overnight I realized their sysadmin may damage the raid array if he didn't know what he was doing, so I asked them to clone the disk before making any changes.

Once I woke up, I noticed site was still down to my surprise so I asked for an update and was told the clone was still in process. I left the house and proceeded driving to Guayaquil for my appointment.

Once I arrived, I asked for another update and was told he was working on it now.

After another hour or two I was told the admin could not fix the problem. At this time I asked @sam028 to call me and I told him to what needed to be done, and he started to work on it. There was another delay at the data center while they installed a third drive in the system to load grub on to.

As I left Guayaquil to return home, @sam028 told me there was a problem and the third new drive was not being detected, so I then just asked him to scp copy part of the file system from atlantis to phoenix, because I thought it would be enough to get the site back up. He did (it took a while, it's very large even over GbE).

@sam028 was able to get most of the site back up, and I was two hours from home at this point to finish the rest.

- Site came back up July 9 at 5:01pm Eastern.

Unfortunately last night as I was repairing the rest of things (webinars, email, etc) another drive failed on Atlantis and suddenly I was experiencing file system corruption (I've never had this happen before, ever).

So at this point, I think I will have to just blow atlantis away and do a clean install. In a way, it is actually not a terrible waste of time because I was thinking of dedicating one of the servers to the new futures.io (formerly BMT) 5.0 test which is coming up (search for that thread).

Bottom line, huge amount of unexpected downtime, complicated by the fact I had to be travelling at the exact same time.

Mike

July 10th, 2014, 12:09 PM

I saw in a thread this morning someone said some of his posts were missing.

I believe he is wrong.

There is zero indication that anything is missing or was lost. The MySQL database was never in jeopardy and there is no reason to believe that anything was lost or is missing.

Most likely, I believe he may have written a post during a 5 minute period of time last night where I had a hung NFS mount on phoenix that caused the site to become unresponsive for about 4-5 minutes. It would have meant he received a "no response" message in his browser, and that is unfortunate, but I wouldn't call it "missing".

That said, please report any problems in this thread.

Mike

July 10th, 2014, 02:16 PM

Guys, I also want to point out I was making updates on our futures.io (formerly BMT) twitter:

https://www.twitter.com/bigmiketrading

You can follow on Twitter and get future updates so you aren't left wondering what is going on.

Mike

Back in business!

Discussion in Feedback and Announcements

Back in business!