MariaDB Server Unresponsive

I have a server that suddenly gets frozen until reboot. It’s a VM running on Proxmox.

  • Changed from CentOS 7 to 8.
  • Changed from MariaDB 10.5 to 10.4
  • Reinstalled Proxmox.

And keeps happening.

This is a Proxmox host running in a Ryzen VPS.

Any ideas where to look?

Waaa, downgrading from MariaDB 10.5 to MariaDB 10.4? They introduced a new auth system. So you’re a hero, I wouldn’t do that myself. :stuck_out_tongue:

Check the logs? :slight_smile:

1 Like

Any idea where to find them?

MariaDB [(none)]> show global variables like 'log_error';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| log_error     |       |
+---------------+-------+

This is the content of /var/log/

Well, maybe its the host?
Or you have queries, that lock your database server up.

You haven’t specified an error log location in my.cnf, you need to.

You can also check live queries using mtop or show processlist;

1 Like

more information needed…

  • what do you mean by frozen, the service, the whole VM
  • is it still accessible via ssh or vnc in that state or not
  • can you match the timestamp to any syslog entries and is there anything after the freeze
  • what else is running on it other then mariadb
  • which hardware specs/options did you use, esp. for disk (thin-lvm, raid, virtio, scsi etc.?), network (virtio, local vs public ip etc.), memory (swap available?)

Thanks will do.

Do you mean mytop? Installing… but the problem happens mostly when I’m not at the computer.

The whole VM.

No SSH connection. Ping stops. VNC shows the login screen and nothing can be done there.

Can’t find /var/log/messages in CentOS 8, I’m going to read where it should be in this version.

New Relic monitoring agent, qemu agent. But it happened when I was using CentOS 7 with MariaDB 10.5 without those agents.

Host (it’s a VM too):

Guest:

Plenty of RAM and swap usage at zero on both. No high load detected.

I will update when I set up the logs correctly.

Install atop with a 30 second granularity (/etc/sysconfig/atop) to see what’s going on. Sounds like you’ve got a spike in load that you’re misattributing to MariaDB.

Next time it goes down flip through the log, atop -r /var/log/atop/atop_2020YYMMDD to see what’s happening prior to the lockup.

2 Likes

Should I run systemctl enable/start atop after installing?

Yup after dropping LOGINTERVAL from 600 to 30. systemctl enable --now atop does both.

so you are using nested virt? (I think I now remember something in the other thread about the IPs…)
do you have other VMs running in parallel in your proxmox?

my bet would be on something like hitting IO limits or whatever. system not able to read or write properly anymore… mariadb does not has to be the direct cause as @nem already wrote. could simply add to the problem.

do you know what the underlying storage system is on the real hostnode? how did you setup your storage? zfs? thin-lvm? or just plain ext storage?

Nooice.

Did I use “nooice” correctly?

Yes to all, I have a web server in another guest.

Not sure. Maybe @seriesn can comment something.

I installed Debian and Proxmox on top, a big partition.

No idea about lvm stuff, that’s something pending to learn.

Looks LVM. Anything happens when you boot via rescue mode?

It’s already a production server, I can’t restart. I will need to migrate the database.

I will try to configure the logs properly in a few hours, maybe they catch something useful if it happens again. Fortunately I have another little server to use temporary.

No, mtop, to monitor the MySQL queries.

1 Like

I reinstalled again last night. Running CentOS 7 and MariaDB 10.5, let’s see how it goes.

All logs recommended here are configured and it’s running Nixstats agent too.

Let’s see how it goes.

Thank you everyone!.

Make sure you have a syslog daemon like rsyslog installed and running.

1 Like

It happened again :frowning:

This is what I got from the console:

Output of /var/log/messages: Sep 30 08:01:01 db3 systemd: Created slice User Slice of root.Sep 30 08:01:0 - Pastebin.com (server went down at 11:09 AM)

First screenshot is a kernel panic. Is your microcode up to date? Firmware? Anything tasty in /var/log/boot.log? I’ve seen bad memory for example result in sporadic panics under load. As an example, this line was enough to deduce the memory was bad.

[    0.000000]  gran_size: 64K     chunk_size: 16M     num_reg: 10      lose cover RAM: 238M
[    0.000000] *BAD*gran_size: 64K     chunk_size: 32M     num_reg: 10      lose cover RAM: -18M
[    0.000000] *BAD*gran_size: 64K     chunk_size: 64M     num_reg: 10      lose cover RAM: -18M
[    0.000000] *BAD*gran_size: 64K     chunk_size: 128M     num_reg: 10      lose cover RAM: -16M
[    0.000000] *BAD*gran_size: 64K     chunk_size: 256M     num_reg: 10      lose cover RAM: -16M
[    0.000000] *BAD*gran_size: 64K     chunk_size: 512M     num_reg: 10      lose cover RAM: -16M
[    0.000000] *BAD*gran_size: 64K     chunk_size: 1G     num_reg: 10      lose cover RAM: -512M
[    0.000000] *BAD*gran_size: 64K     chunk_size: 2G     num_reg: 10      lose cover RAM: -1536M

Client had memory replaced and his server has been humming ever since.

1 Like

Not sure about that. At least there are not package updates available in both host and guest.

Everything says OK: [root@db3 ~]# cat /var/log/boot.log[ OK ] Started Show Plymouth Boot Screen. - Pastebin.com

Is it possible there are memory problems? This failing server is a VM, another VM on the same host is running completely fine. Host is a VM too.