2018-07-25 00:32:57 UTC
I have encountered a problem on one kind of hardware. The same thing works correctly in all other systems. The nature of the problem is that while grub is reading the initrd file, it hits a GPF. The bios prints out a register dump and returns to grub, which promptly hits another GPF. This goes on forever until I powercycle the system.
We're using official debian grub packages. I see the problem in Debian 2.02~beta2-22+deb8u1, Debian 2.02~beta3-5, as well as in grub built from the top of the grub source tree about two weeks ago. It never fails on grub 1.95.
The unique thing about the systems that fail is that they do not use efi, and they are booting from mdraid partitions. All the systems with efi work fine, with or without mdraid. All the non-efi systems with only a single boot drive (no raid) work fine.
When it hits the GPF, the EIP is pointing into the middle of a block of zeroes. Seems unlikely to be a real code area.
Here are some further things I've found out:
1. If I enter the grub command prompt and execute the grub commands manually, it works. The same commands read from the grub.cfg file hit the error.
2. If I edit the grub.cfg and insert a 3 second sleep after it reads the kernel, but before it reads the initrd, then it works.
3. If I hack the function grub_cmd_initrd() by adding a call to invalidate the grub cache at the beginning of the function, that makes it work.
4. I booted into a failing system using a rescue boot method, and deleted all the raid partitions. Since we use raid metadata format 1.0, the system should still be bootable as a raw disk instead of as a raid. I rebooted, and was dropped into grub-rescue because it couldn't find the original raid volume that was there when I ran grub-install. However, I pointed it to the /boot on the raw disk device (hd0,2) and voila, it was able to boot. This makes me think the problem is in grub's low level raid code for the i386-pc case. i.e., grub was reading the exact same blocks that it reads in the failure case, but it was reading them without using the mdraid driver. With the mdraid driver, it fails.
5. The initrd is about 5.7 Mbytes in size. I copied a smaller one from another system that was only 3.6 Mbytes. That enabled it to work. I suspect this is because the smaller initrd was so small that it finished loading before the 2 seconds elapsed.
6. Changing the mdraid metadata format doesn't help. I've tried 0.9, 1.0, and 1.2. All behave the same way.
I began suspecting some interaction with the cache when I noticed that the cache timeout is 2 seconds, and I had separately found that a delay > 2 seconds would fix the problem. So there's some kind of interaction between the raid driver and the grub cache that hits this. I tried changing the cache algorithm in various ways so that blocks would hash to different locations, but this didn't help. I also tried locating /boot in a different partition, and this didn't help either.
So at the moment, I can proceed with the workaround of adding a sleep in the grub.cfg file. However, it would be much better if we could get clarity that this is indeed a grub bug. I hesitate to file a bug at the moment because it would be very difficult to provide a way to reproduce it. I'm working to see if I can reproduce it on generic hardware.
If anyone can provide hints for how to debug this, it would be most welcome. At the moment I'm modifying the source, adding printouts etc., then recompiling grub and reinstalling it on the target. Also, if anyone knows a quick hack to disable the grub cache, I'd like to try it.