<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Post Incident Review on Mike Bell - Blog &amp; Stuff</title><link>https://mikebell.io/categories/post-incident-review/</link><description>Recent content in Post Incident Review on Mike Bell - Blog &amp; Stuff</description><generator>Hugo -- gohugo.io</generator><language>en</language><managingEditor>hello@mikebell.io (Mike Bell)</managingEditor><webMaster>hello@mikebell.io (Mike Bell)</webMaster><copyright>© 2026 Mike Bell</copyright><lastBuildDate>Mon, 12 May 2025 13:34:31 +0000</lastBuildDate><atom:link href="https://mikebell.io/categories/post-incident-review/index.xml" rel="self" type="application/rss+xml"/><item><title>Homelab Post Incident Review 11/05/25</title><link>https://mikebell.io/posts/homelab-post-incident-review-11-05-25/</link><pubDate>Mon, 12 May 2025 13:34:31 +0000</pubDate><author>hello@mikebell.io (Mike Bell)</author><guid>https://mikebell.io/posts/homelab-post-incident-review-11-05-25/</guid><description>
&lt;blockquote>
&lt;p>Since it&amp;rsquo;s important to practice what you preach (apparently) here&amp;rsquo;s my post incident report on a P1 homelab failure&lt;/p>&lt;/blockquote>
&lt;h2 class="relative group">Timeline
&lt;div id="timeline" class="anchor">&lt;/div>
&lt;span
class="absolute top-0 w-6 transition-opacity opacity-0 ltr:-left-6 rtl:-right-6 not-prose group-hover:opacity-100">
&lt;a class="group-hover:text-primary-300 dark:group-hover:text-neutral-700 !no-underline" href="#timeline" aria-label="Anchor">#&lt;/a>
&lt;/span>
&lt;/h2>
&lt;p>09:30 - Services slow, services down&lt;br>
10:00 - Attempt to upgrade Ubuntu and reboot VM&lt;br>
10:00 - CPU spiking 100% across all 8 cores&lt;br>
10:15 - Increase core count to 16 and reboot VM&lt;br>
10:30 - Slow recovery but some services still down&lt;br>
16:00 - Server not on network&lt;br>
18:00 - Server powered on but no response&lt;br>
18:30 - Server disassembled and left to cool - fans cleaned a bit&lt;br>
19:00 - Services recovered&lt;/p>
&lt;h2 class="relative group">Findings
&lt;div id="findings" class="anchor">&lt;/div>
&lt;span
class="absolute top-0 w-6 transition-opacity opacity-0 ltr:-left-6 rtl:-right-6 not-prose group-hover:opacity-100">
&lt;a class="group-hover:text-primary-300 dark:group-hover:text-neutral-700 !no-underline" href="#findings" aria-label="Anchor">#&lt;/a>
&lt;/span>
&lt;/h2>
&lt;ul>
&lt;li>There was no indication that temperature was an issue even after our primary on call engineer (me) saying &amp;ldquo;it&amp;rsquo;s pretty hot in here (lounge)&amp;rdquo;.&lt;/li>
&lt;li>No monitoring of system stats present&lt;/li>
&lt;li>While upping CPU core count helped it made the situation worse by ultimately overheating&lt;/li>
&lt;li>No notifications system in place for failures, notification of system down was via a third party (ADSB Exchange)&lt;/li>
&lt;li>Fans are really dirty&lt;/li>
&lt;/ul>
&lt;h2 class="relative group">Notes
&lt;div id="notes" class="anchor">&lt;/div>
&lt;span
class="absolute top-0 w-6 transition-opacity opacity-0 ltr:-left-6 rtl:-right-6 not-prose group-hover:opacity-100">
&lt;a class="group-hover:text-primary-300 dark:group-hover:text-neutral-700 !no-underline" href="#notes" aria-label="Anchor">#&lt;/a>
&lt;/span>
&lt;/h2>
&lt;p>External logging (partially done) and monitoring need to put in place.&lt;/p>
&lt;p>Cans of compressed air have been ordered so that the fan can be cleared out properly to help airflow.&lt;/p>
&lt;p>A bigger rework of the &amp;ldquo;server cabinet&amp;rdquo; (it&amp;rsquo;s a few shelves in the lounge) needs to be done. If the server cabinet is moved into the garage then temperature and dust wouldn&amp;rsquo;t be too much of an issue. Actually having a proper server cabinet would be nice as well!&lt;/p>
&lt;p>The reverse proxy setup is an annoying problem, if the main vm goes down then I lose access to friendly urls for Proxmox, I&amp;rsquo;ve documented the IP in my runbooks but it&amp;rsquo;d be nicer to pull the proxy setup into it&amp;rsquo;s own lxc container (partially done) that boots first.&lt;/p>
&lt;p>There is a rogue Forgejo container running on boot and I&amp;rsquo;ve no idea where it&amp;rsquo;s setup, I need to remove it properly since it&amp;rsquo;s not needed.&lt;/p>
&lt;p>Rclone mounts get corrupted very easily. I had to run the disable/enable/re-setup process twice for the docker mounts. It also means that other docker services can&amp;rsquo;t be started properly.&lt;/p>
&lt;h2 class="relative group">Conclusion
&lt;div id="conclusion" class="anchor">&lt;/div>
&lt;span
class="absolute top-0 w-6 transition-opacity opacity-0 ltr:-left-6 rtl:-right-6 not-prose group-hover:opacity-100">
&lt;a class="group-hover:text-primary-300 dark:group-hover:text-neutral-700 !no-underline" href="#conclusion" aria-label="Anchor">#&lt;/a>
&lt;/span>
&lt;/h2>
&lt;p>It was a typical homelab failure, lots learnt and lots to do to improve things. I&amp;rsquo;m a bit annoyed that I didn&amp;rsquo;t have temperature down as a potential failure in my system. I&amp;rsquo;ve no doubt decreased the lifespan of the hardware now as well.&lt;/p>
&lt;p>It was made clear to me by stakeholders (my wife) that it was not acceptable that Home Assistant was down. My pay has been docked this month.&lt;/p>
&lt;p>Thanks for reading via RSS!&lt;/p>
&lt;p>Send me a message on &lt;a href="https://remotelab.uk/mikebell">Mastodon&lt;/a> or &lt;a href="mailto:hello@mikebell.io">email me&lt;/a>&lt;/p></description></item></channel></rss>