One of the things I’ve been working on fairly quietly is the test suite for OpenPOWER firmware: op-test-framework. I’ve approach things I’m hacking on from the goal of “when I merge patches into skiboot, can I be confident that I haven’t merged something that’s broken existing functionality?”
By testing host firmware, we incidentally (as well as on purpose) test a whole bunch of BMC functionality. One bit of functionality we rely on a lot is the host “serial” console. Typically, this is exposed to the user over IPMI Serial Over LAN (SOL), or on OpenBMC it’s also exposed as something you can ssh to (which proves to be both faster and more reliable than IPMI, not to mention there’s some remote semblance of security).
When running through some tests, I noticed something pretty odd, it appeared as though we were sometimes missing some console output on larger IOs. This usually isn’t a problem as when we’re using expect(1) (or the python equivalent pexpect) we end up having all sorts of delays here there and everywhere to work around all the terrible things you hope you never learn. So… how could I test that? Well.. what about checking the output of something like dd if=/dev/zero bs=1024 count=16|hexdump -C
to see if we get the full output?
Time to add a test to op-test-framework! Adding such a test is pretty easy. If we look at the source of the test I added, we can see what happens (source here).
class Console(): bs = 1024 count = 8 def setUp(self): conf = OpTestConfiguration.conf self.bmc = conf.bmc() self.system = conf.system() def runTest(self): self.system.goto_state(OpSystemState.PETITBOOT_SHELL) console = self.bmc.get_host_console() self.system.host_console_unique_prompt() bs = self.bs count = self.count self.assertTrue( (bs*count)%16 == 0, "Bug in test writer. Must be multiple of 16 bytes: bs %u count %u / 16 = %u" % (bs, count, (bs*count)%16)) try: zeros = console.run_command("dd if=/dev/zero bs=%u count=%u|hexdump -C -v" % (bs, count), timeout=120) except CommandFailed as cf: self.assertEqual(cf.exitcode, 0) self.assertTrue( len(zeros) == 3+(count*bs)/16, "Unexpected length of zeros %u" % (len(zeros)))
First thing you’ll notice is that this looks like a Python unittest. It’s because it is. The unittest infrastructure was a path of least resistance, so we started with it. This class isn’t the one that’s actually run, we do a little bit of inheritance magic in order to run the same test with different parameters (see https://github.com/open-power/op-test-framework/blob/6c74fb0fb0993ae8ae1a7aa62ec58e57c0080686/testcases/Console.py#L50)
class Console8k(Console, unittest.TestCase): bs = 1024 count = 8 class Console16k(Console, unittest.TestCase): bs = 1024 count = 16 class Console32k(Console, unittest.TestCase): bs = 1024 count = 32
The setUp()
function is pure boiler plate, we grab some objects from the configuration of the test run, namely information about the BMC and the system itself, so we can do things to both. The real magic happens in runTest().
op-test-framework tracks the state of the machine being tested across each test. This means that if we’re executing 101 tests in the petitboot shell, we don’t need to do 101 separate boots to petitboot. The self.system.goto_state(OpSystemState.PETITBOOT_SHELL)
statement says “Please ensure the system is booted to the petitboot shell”. Other states include OFF (obvious) and OS, which is when the machine is booted to the OS.
The next two lines ensure we can run commands on the console (where console is IPMI Serial over LAN or other similar connection, such as the SSH console provided by OpenBMC):
console = self.bmc.get_host_console() self.system.host_console_unique_prompt()
The host_console_unique_prompt() call is a bit ugly, and I’m hoping we fix the APIs so that this isn’t needed. Basically, it sets things up so that pexpect will work properly.
The bit that does the work is the try/except block along with the assertTrue. We don’t currently check that the content is all correct, we just check we got the right *amount* of content.
It turns out, this check is enough to reveal a bug that turns out to be deep in the core Linux TTY layer, and has caused Jeremy some amount of fun (for certain values of fun).
Want to know more about how the console works? Jeremy blogged on it.
So it turns out that this fixes most of the issue:
https://lkml.org/lkml/2017/7/25/149
– but the underlying tty buffer problem still applies, hence the pending “fun”!