Welcome to NexusFi: the best trading community on the planet, with over 150,000 members Sign Up Now for Free
Genuine reviews from real traders, not fake reviews from stealth vendors
Quality education from leading professional traders
We are a friendly, helpful, and positive community
We do not tolerate rude behavior, trolling, or vendors advertising in posts
We are here to help, just let us know what you need
You'll need to register in order to view the content of the threads and start contributing to our community. It's free for basic access, or support us by becoming an Elite Member -- see if you qualify for a discount below.
-- Big Mike, Site Administrator
(If you already have an account, login at the top of the page)
- July 8 at 7:22pm Eastern, both servers rebooted for security patch maintenance
- phoenix came back fine
- atlantis did not, missing bootloader
Atlantis had a drive failure in the R1 array a couple weeks ago, and apparently when resynced on the new drive the bootloader partition was somehow missed, causing the system to not be bootable.
I immediately tried to reinstall grub, but the problem is that the mdadm array was occupying 100% of both sda and sdb with no available partition (even 1MB) to install grub. And XFS cannot be shrunk, so I couldn't shrink the array.
I then tried to use rescue tools to get grub to install to another media and boot atlantis. Unfortunately, I needed the data center to help physically with this. And most unfortunately, I had to go to bed immediately because I had an unalterable appointment in Guayaquil at 9am and 3 hour drive away that I had to wake up at 4am for the next morning (part of my visa process).
The data center told me they could fix this, their linux sysadmin could do it "shortly". I went to bed thinking it would be fixed within an hour or so. Overnight I realized their sysadmin may damage the raid array if he didn't know what he was doing, so I asked them to clone the disk before making any changes.
Once I woke up, I noticed site was still down to my surprise so I asked for an update and was told the clone was still in process. I left the house and proceeded driving to Guayaquil for my appointment.
Once I arrived, I asked for another update and was told he was working on it now.
After another hour or two I was told the admin could not fix the problem. At this time I asked @sam028 to call me and I told him to what needed to be done, and he started to work on it. There was another delay at the data center while they installed a third drive in the system to load grub on to.
As I left Guayaquil to return home, @sam028 told me there was a problem and the third new drive was not being detected, so I then just asked him to scp copy part of the file system from atlantis to phoenix, because I thought it would be enough to get the site back up. He did (it took a while, it's very large even over GbE).
@sam028 was able to get most of the site back up, and I was two hours from home at this point to finish the rest.
- Site came back up July 9 at 5:01pm Eastern.
Unfortunately last night as I was repairing the rest of things (webinars, email, etc) another drive failed on Atlantis and suddenly I was experiencing file system corruption (I've never had this happen before, ever).
So at this point, I think I will have to just blow atlantis away and do a clean install. In a way, it is actually not a terrible waste of time because I was thinking of dedicating one of the servers to the new futures.io (formerly BMT) 5.0 test which is coming up (search for that thread).
Bottom line, huge amount of unexpected downtime, complicated by the fact I had to be travelling at the exact same time.
I saw in a thread this morning someone said some of his posts were missing.
I believe he is wrong.
There is zero indication that anything is missing or was lost. The MySQL database was never in jeopardy and there is no reason to believe that anything was lost or is missing.
Most likely, I believe he may have written a post during a 5 minute period of time last night where I had a hung NFS mount on phoenix that caused the site to become unresponsive for about 4-5 minutes. It would have meant he received a "no response" message in his browser, and that is unfortunate, but I wouldn't call it "missing".
That said, please report any problems in this thread.