Checked the raid status and sdb5 was marked as failed.
Support team told me to better check the status using a live CD.
Using Ubuntu live CD I stopped both md0 and md1. Not sure why. Can’t start them again. I did this at some point too:
Summary
root@ubuntu:~# sfdisk -d /dev/sda | sfdisk /dev/sdb
Checking that no-one is using this disk right now ... OK
**Disk /dev/sdb: 223.58 GiB, 240057409536 bytes, 468862128 sectors**
Disk model: EDGE SE847-V SSD
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0xa34de2a8
Old situation:
**Device** **Boot** **Start** **End** **Sectors** **Size** **Id** **Type**
/dev/sdb1 * 2048 1953791 1951744 953M fd Linux raid autodetect
/dev/sdb2 1955838 468860927 466905090 222.7G 5 Extended
/dev/sdb5 1955840 468860927 466905088 222.7G fd Linux raid autodetect
Partition 2 does not start on physical sector boundary.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Created a new DOS disklabel with disk identifier 0xa34de2a8.
/dev/sdb1: Created a new partition 1 of type 'Linux raid autodetect' and of size 953 MiB.
Partition #1 contains a linux_raid_member signature.
/dev/sdb2: Created a new partition 2 of type 'Extended' and of size 222.7 GiB.
/dev/sdb3: Created a new partition 5 of type 'Linux raid autodetect' and of size 222.7 GiB.
Partition #5 contains a linux_raid_member signature.
/dev/sdb6: Done.
New situation:
Disklabel type: dos
Disk identifier: 0xa34de2a8
**Device** **Boot** **Start** **End** **Sectors** **Size** **Id** **Type**
/dev/sdb1 * 2048 1953791 1951744 953M fd Linux raid autodetect
/dev/sdb2 1955838 468860927 466905090 222.7G 5 Extended
/dev/sdb5 1955840 468860927 466905088 222.7G fd Linux raid autodetect
Partition 2 does not start on physical sector boundary.
The partition table has been altered.
Calling ioctl() to re-read partition table.
Re-reading the partition table failed.: Device or resource busy
The kernel still uses the old table. The new table will be used at the next reboot or after you run partprobe(8) or kpartx(8).
Syncing disks.
Asked for help at HostBalls. I want my raid working again with the data I have in /dev/sda
I can go with the fastest way and reinstall everything, but I need to learn in case of a future disaster.
Ask for a disk change. Don’t waste time on it. As you can already see issues with the sectors. If this is a cheapo dedi , you know… under 60/mo , I would suggest you stay away from HDD since usually they are worn out / have lots of usage and might not be enterprise grade.
Also RAID resync with HDD is so effin painful with RAID 1, your server performance will be stupidly bad until it finish syncing. Your good disk might die trying to resync to the other one if they have lots of usage on them.
You don’t have to be 100% sure. That’s the provider responsibility, you report a bad disk, they swap. Sometimes providers do verify before doing the swap. Better safe than sorry.
I’m not particularly an expert on smart logs but it seems they both passed. However sda is the newer disk? since it has only 10 hours of usage so that means sdb is the disk that didn’t failed of the old RAID 1 set. (Assuming you purchased these disks as new).
And sdb is the one producing the errors. If I’m right maybe SDB is failing? Did you swapped the correct disk?
Is that SSD brand decent? This is the second time I read a mention of EDGE SSD.
there is nothing wrong with those SSDs. without knowing the exact status of your raid right now, and the commands you run before to make it fail, probably no one can really help you.
if you forced an error on only one of the drives/partitions essentially you need to remove it from your array, clean the mbr to be able to add it back to your raid like an empty disk and have it rebuild from the still living part.
that only works for a degraded raid though. if it already failed completely, then you have a bigger problem.
in general I think it’s a good idea to think and learn about recovering and even test it, so you are prepared just in case. anyway I think you’re strategy on a degraded raid-1 shoudl always be to move out data first and only after that maybe try to restore/replace the disk.
I’d want to know the exact commands to exclude human error in rebuilding. These are things better studied under a homelab than remotely so you can control these conditions. Once upon a time I pulled a bad drive, rebuilt the array with a good drive from a decommissioned server, but flipped the syntax and marked the present drive as failed in mdadm. I caught it after a couple minutes, but the damage was irreparable.