36-Hour Vacation from Web Serving
Upgraded Friendica to the latest stable that was released this weekend, and figured I'd do a round of upgrading everything for fun. Except, my FreeBSD pkg upgrade
command bombed out during its PHP upgrade and actually hard-locked its VM guest inside Proxmox, leading to a 36-hour outage of my home server.
Initially, it seemed that my ol' FreeBSD guest VM had a kernel panic, causing the UFS file system to corrupt a directory inode.
panic: ufs_dirbad: bad dir
Which sucked, because repairing the problem meant removing that inode, which wiped out /usr/local/lib
.
[root@irev]#0:~# sync
Aug 19 00:56:38 irev kernel: (da0:vtscsi0:0:0:0): WRITE(10). CDB: 2a 00 10 24 fe a8 00 00 c0 00
Aug 19 00:56:38 irev kernel: (da0:vtscsi0:0:0:0): CAM status: SCSI Status Error
Aug 19 00:56:38 irev kernel: (da0:vtscsi0:0:0:0): SCSI status: Reservation Conflict
Aug 19 00:56:38 irev kernel: (da0:vtscsi0:0:0:0): Error 5, Unretryable error
Aug 19 00:56:38 irev kernel: g_vfs_done():da0p2[WRITE(offset=112292737024, length=8192)]error = 6
Aug 19 00:56:38 irev kernel: g_vfs_done():da0p2[WRITE(offset=112292745216, length=8192)]error = 6
Aug 19 00:56:38 irev kernel: UFS: forcibly unmounting /dev/da0p2 from /
Aug 19 00:56:38 irev kernel: g_vfs_done():da0p2[WRITE(offset=112292753408, length=4096)]error = 6
Aug 19 00:56:38 irev kernel: g_vfs_done():da0p2[WRITE(offset=112376184832, length=8192)]error = 6
I'm still confused how a virtualized/software fake filesystem can have bad diskdrive blocks, and even more so when its in a ZFS RAID. Shouldn't ZFS catch that one of the physical SSDs was crapping out, writing incorrectly? Or during the weekly ZFS scrub?
Realizing that my pair of 5-year-old SATA SSD drives were well over double of their wear-life, I rolled out and picked up a pair of 1TB SSD drives. Thanks to succinct directions in a couple of blog posts out there, I was able to replace both old SSD drives with the new ones, grow my ZFS space, and resilver everything easily, in under a half-hour.
Even with the new physical drives, the guest VM FreeBSD server's virtual "disk" was still corrupted. I wasn't able to run a badblocks
or smartctl
scan of the poor ol' disk image, and anytime I tried to re-install anything that was toast from /usr/local/lib
not having any supporting files, the kernel would panic again.
I had to spin up a fresh new FreeBSD guest VM server in my Proxmox host, and pull together all the services, programs, and stuff that makes my server do the stuff I want. However, I couldn't copy configuration or home files over to the new box, because the old box couldn't run rsync
anymore. With patient hand-holding, I recompiled rsync and brought over 300GB of files and configs. Once the new box was behaving nicely, I told Proxmox to swap the virtio "ethernet" MAC addresses around between the two guests, and new FreeBSD guest took over.
I ran into some challenges with Apache and PHP not running my users' websites as their user names & permissions. This took a while for me to hammer out because of snags in the new installation that I didn't remember from several years ago when I assembled the original server. There's a thing in /etc/make.conf
that needed to be configured for compiling mod_suexec
, how to compile Apache to pre-fork workers for each user, and compiling mod_fcgid
all special. This took forever for me to find the right combinations, grrr.
36 hours of downtime, I'm a little disappointed in myself that all this took so long to iron out, but hurray that I was able pull it off, 'specially after so many years of me being out of the computars hackings field.