Rebuild Software RAID 1

Ok not sure what I’ve done. This is a bare metal server. I have:

/dev/md0 mirroring /dev/sda1 and /dev/sdb1
/dev/md1 mirroring /dev/sda5 and /dev/sdb5

Then:

  1. I forced a failure on the raid devices to make some tests and to learn how it works.
  2. Added everything back to the raid.
  3. I made some disk benchmarks again. Can’t cancel the task. Terminal does not respond. Checked the console and saw some errors. Rebooted.

  1. Checked the raid status and sdb5 was marked as failed.
  2. Support team told me to better check the status using a live CD.
  3. Using Ubuntu live CD I stopped both md0 and md1. Not sure why. Can’t start them again. I did this at some point too:
Summary
root@ubuntu:~# sfdisk -d /dev/sda | sfdisk /dev/sdb

Checking that no-one is using this disk right now ... OK

**Disk /dev/sdb: 223.58 GiB, 240057409536 bytes, 468862128 sectors**

Disk model: EDGE SE847-V SSD

Units: sectors of 1 * 512 = 512 bytes

Sector size (logical/physical): 512 bytes / 4096 bytes

I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Disklabel type: dos

Disk identifier: 0xa34de2a8

Old situation:

**Device** **Boot** **Start** **End** **Sectors** **Size** **Id** **Type**

/dev/sdb1 * 2048 1953791 1951744 953M fd Linux raid autodetect

/dev/sdb2 1955838 468860927 466905090 222.7G 5 Extended

/dev/sdb5 1955840 468860927 466905088 222.7G fd Linux raid autodetect

Partition 2 does not start on physical sector boundary.

>>> Script header accepted.

>>> Script header accepted.

>>> Script header accepted.

>>> Script header accepted.

>>> Created a new DOS disklabel with disk identifier 0xa34de2a8.

/dev/sdb1: Created a new partition 1 of type 'Linux raid autodetect' and of size 953 MiB.

Partition #1 contains a linux_raid_member signature.

/dev/sdb2: Created a new partition 2 of type 'Extended' and of size 222.7 GiB.

/dev/sdb3: Created a new partition 5 of type 'Linux raid autodetect' and of size 222.7 GiB.

Partition #5 contains a linux_raid_member signature.

/dev/sdb6: Done.

New situation:

Disklabel type: dos

Disk identifier: 0xa34de2a8

**Device** **Boot** **Start** **End** **Sectors** **Size** **Id** **Type**

/dev/sdb1 * 2048 1953791 1951744 953M fd Linux raid autodetect

/dev/sdb2 1955838 468860927 466905090 222.7G 5 Extended

/dev/sdb5 1955840 468860927 466905088 222.7G fd Linux raid autodetect

Partition 2 does not start on physical sector boundary.

The partition table has been altered.

Calling ioctl() to re-read partition table.

Re-reading the partition table failed.: Device or resource busy

The kernel still uses the old table. The new table will be used at the next reboot or after you run partprobe(8) or kpartx(8).

Syncing disks.
  1. Asked for help at HostBalls. I want my raid working again with the data I have in /dev/sda

I can go with the fastest way and reinstall everything, but I need to learn in case of a future disaster.

Sounds like that disk is dying.

Were you able to check smartctl stats when in rescue/live cd mode?

It’s relatively new.

Will try later, but last time I didn’t see anything interesting.

Ask for a disk change. Don’t waste time on it. As you can already see issues with the sectors. If this is a cheapo dedi , you know… under 60/mo , I would suggest you stay away from HDD since usually they are worn out / have lots of usage and might not be enterprise grade.

Also RAID resync with HDD is so effin painful with RAID 1, your server performance will be stupidly bad until it finish syncing. Your good disk might die trying to resync to the other one if they have lots of usage on them.

Is it 100% sure it is a bad disk? It was replaced recently.

You don’t have to be 100% sure. That’s the provider responsibility, you report a bad disk, they swap. Sometimes providers do verify before doing the swap. Better safe than sorry.

It’s a colocated server.

oh well thats another story… I hate colocating because of that. Means you need to have enough spare parts and pay remote hands too.

If is going to cost you, better run all the diagnostic tools first, I guess, unless is production and you don’t have the luxury of downtime.

sda root@ubuntu:~# smartctl -a /dev/sdasmartctl 7.1 2019-12-30 r5022 [x86_64-linux - Pastebin.com
sdb root@ubuntu:~# smartctl -a /dev/sdbsmartctl 7.1 2019-12-30 r5022 [x86_64-linux - Pastebin.com

There is an error logged in both, they appeared when I restarted the server while this happened:

I’m not particularly an expert on smart logs but it seems they both passed. However sda is the newer disk? since it has only 10 hours of usage so that means sdb is the disk that didn’t failed of the old RAID 1 set. (Assuming you purchased these disks as new).

And sdb is the one producing the errors. If I’m right maybe SDB is failing? Did you swapped the correct disk?

Is that SSD brand decent? This is the second time I read a mention of EDGE SSD.

Everything is second-hand.

All was fine until I started playing with the RAID config. Maybe something went wrong and sdb got corrupted?

Now I want to try to restore the RAID. Actually the data is not important, but I’d like to try.

Did you ever completed a RAID 1 rebuild ?

Yes. But failed then.

If you guys know what steps I have to follow to rebuild everything using the data on /dev/sda, I will thank you a lot.

there is nothing wrong with those SSDs. without knowing the exact status of your raid right now, and the commands you run before to make it fail, probably no one can really help you.

if you forced an error on only one of the drives/partitions essentially you need to remove it from your array, clean the mbr to be able to add it back to your raid like an empty disk and have it rebuild from the still living part.

that only works for a degraded raid though. if it already failed completely, then you have a bigger problem.

in general I think it’s a good idea to think and learn about recovering and even test it, so you are prepared just in case. anyway I think you’re strategy on a degraded raid-1 shoudl always be to move out data first and only after that maybe try to restore/replace the disk.

4 Likes

I’d want to know the exact commands to exclude human error in rebuilding. These are things better studied under a homelab than remotely so you can control these conditions. Once upon a time I pulled a bad drive, rebuilt the array with a good drive from a decommissioned server, but flipped the syntax and marked the present drive as failed in mdadm. I caught it after a couple minutes, but the damage was irreparable.

2 Likes

i got bored and I’m already reinstalling the server :roll_eyes:

If it’s not puking storage errors after reinstall all signs point to human error :thinking:

The only way to remedy that is to try, try again :slight_smile:

4 Likes

I learnt a lot about sysadmin in the early by human error/deliberately breaking things - and then having to fix it.