Ben Leskey | Blog

Debugging slow boot and a clogged router

2023-03-08 #debugging #software

I powered on my laptop yesterday morning and saw the Ubuntu boot had stalled on two systemd services: trying to mount a disk by UUID, with a 90 second timeout, and waiting for the network to come up, with no timeout. Before the 90 seconds for mounting the disk was up, the boot continued and I got to the graphical login menu. Network manager tried to connect to my friends' home Wi-Fi, but the connection failed.

This was odd:

Hypothesis 1: Something was wrong with my laptop

I rebooted the laptop. Always gotta turn it off and back on again. Unfortunantely, the same things happened.

Hypothesis 1.1: Something was wrong with an fstab entry

The first step was to look at the odd error out: the timeout to mount a disk. I looked up the UUID and found that it was an entry in fstab that tried to mount an external volume as ntfs, but it did have nofail specified so the failure to mount shouldn't be an error during boot. This entry had been working for months, so I didn't see why it would be a problem now. Nevertheless I added x-systemd.device-timeout=1 to the mount options and rebooted. Mounting the disk timed out quickly during boot now, and didn't cause any further issues. I guess the reason I never noticed that message before was that it does not affect graphical boot at all, and just quietly timed out after 90 seconds in the background.

This left the long wait for the network to come up during boot, and the failure to connect to the Wi-Fi.

Hypothesis 1.2: Something was wrong with my laptop's network ability

These two remaining problems were obviously tied together: Network manager waited during boot until the network was up, but because it kept trying unsuccessfully to connect to the Wi-Fi it took it until some timeout to bring the network up once it realized that it couldn't connect to the Wi-Fi. Since I'm not running a network-critical server with my laptop, I ran sudo systemctl disable NetworkManager-wait-online.service and turned off the boot delay. I don't need network capability to start my laptop, network manager can connect later, even after I'm logged in.

Now I was down to one problem: my laptop couldn't connect to the Wi-Fi.

Hypothesis 1.2.1: Something was wrong with my laptop's Wi-Fi radio

One of my fears was that I'd damaged my antenna or radio card somehow. To test this, I turned on my phone's hotspot and tried to connect with my laptop. Success. While this didn't entirely rule out my laptop as the problem, it did indicate the issue lay elsewhere.

When I turned my phone's hotspot off, it failed to connect back to the Wi-Fi. Now I was getting somewhere.

Hypothesis 2: Something was wrong with the router

The router was a Verizon FiOS-G1100.

This hypothesis seemed most reasonable now that I had two devices failing to connect. Other people in the house were still using the Wi-Fi, however, with no issues. One key to this hypothesis is that there were far more people on the WAP than normal: nearly a dozen people on the same Wi-Fi, each with their phones, laptops, tablets, etc. I would expect to see unusual failure under these relatively unusual conditions.

Eventually I was able to reboot the router, expecting that to fix the issue. The problem persisted.

I decided to try using a static IP address on my phone instead of using DHCP. It connected to the network with full internet access.

Hypothesis 2.1: Something was wrong with the router DHCP server

Using a static IP address on my laptop solved that connection issue as well. Clearly, there was something wrong with the DHCP server.

When someone noticed that the Roku TV wouldn't connect to the Wi-Fi, I tried to assign it a static IP address as well, but Roku only supports DHCP IP assigning. It was time to fix the router.

Hypothesis 2.1.1: Too many DHCP assignments

My working hypothesis was that somehow the amount of people on the network had run out of IP assignments. I couldn't find evidence in the router console of too many outstanding leases, but I wanted to try resetting the DHCP server anyway, guessing that the outstanding leases were hidden and only the current devices were being show.

There was no simple option to reset the DHCP server. Instead, I changed the IP assignment and gateway range from 192.168.1.x to 192.168.2.x, about which the console helpfully provided a warning saying it would require all devices to be assigned a new IP. Perfect.

As soon as I hit apply, the problems vanished. I re-enabled using DHCP on my phone and laptop, and they connected perfectly. The Roku TV connected, as did every other device in the house.

I still don't know exactly what went wrong, but I've got more experience for next time I run into this kind of problem. Plus, being able to fix it was kinda fun!