Thanks to my most recent PR being merged, op-build v2.5 will have full support for the Raptor Blackbird! This includes support for the “IPL Monitor” that’s required to get fan control going.
Note that if you’re running Fedora 32 then you need some patches to buildroot to have it build, but if you’re building on something a little older, then upstream should build and work straight out of the box (err… git tree).
I also note that the work to get Secure Boot for an OS Kernel going is starting to make its way out for code reviews, so that’s something to look forward to (although without a TPM we’re going to need extra code).
I have done a few builds of firmware for the Raptor Blackbird since I got mine, each of them based on upstream op-build plus a few patches. The previous one was Yet another near-upstream Raptor Blackbird firmware build that I built a couple of months ago. This new build is based off the release candidate of op-build v2.5. Here’s what’s changed:
There’s two differences from upstream op-build: my pull request to op-build, and the fixing of the (old) buildroot so that it’ll build on Fedora 32. From discussions on the openpower-firmware mailing list, it seems that one hopeful thing is to have all the Blackbird support merged in before the final op-build v2.5 is tagged. The previous op-build release (v2.4) was tagged in July 2019, so we’re about 10 months into what was a 2 month release cycle, so speculating on when that final release will be is somewhat difficult.
So, following on from my post on Sensors on the Blackbird (and thus Power9), I mentioned that when you look at the temperature sensors for each CPU core in my 8-core POWER9 chip, they’re not linear numbers. Let’s look at what that means….
stewart@blackbird9$ sudo ipmitool sensor | grep core
p0_core0_temp | na
p0_core1_temp | na
p0_core2_temp | na
p0_core3_temp | 38.000
p0_core4_temp | na
p0_core5_temp | 38.000
p0_core6_temp | na
p0_core7_temp | 38.000
p0_core8_temp | na
p0_core9_temp | na
p0_core10_temp | na
p0_core11_temp | 37.000
p0_core12_temp | na
p0_core13_temp | na
p0_core14_temp | na
p0_core15_temp | 37.000
p0_core16_temp | na
p0_core17_temp | 37.000
p0_core18_temp | na
p0_core19_temp | 39.000
p0_core20_temp | na
p0_core21_temp | 39.000
p0_core22_temp | na
p0_core23_temp | na
You can see I have eight CPU cores in my Blackbird system. The reason the 8 CPU cores are core 3, 5, 7, 11, 15, 17, 19, and 21 rather than 0-8 or something is that these represent the core numbers on the physical die, and the die is a 24 core die. When you’re making a chip as big and as complex as modern high performance CPUs, not all of the chips coming out of your fab are going to be perfect, so this is how you get different models in the line with only one production line.
Weirdly, the output from the hwmon sensors and why there’s a “core 24” and a “core 28”. That’s just… wrong. What it is, however, is right if you think of 8*4=32. This is a product of Linux thinking that Thread=Core in some ways. So, yeah, this numbering is the first thread of each logical core.
But let’s ignore that, go from the IPMI sensors (which also match what the OCC shows with “occtoolp9 -LS” (see below).
$ ./occtoolp9 -SL
Sensor Details: (found 86 sensors, details only for Status of 0x00)
GUID Name Sample Min Max U Stat Accum UpdFreq ScaleFactr Loc Type
....
0x00ED TEMPC03……… 47 29 47 C 0x00 0x00037CF2 0x00007D00 0x00000100 0x0040 0x0008
0x00EF TEMPC05……… 37 26 39 C 0x00 0x00014E53 0x00007D00 0x00000100 0x0040 0x0008
0x00F1 TEMPC07……… 46 28 46 C 0x00 0x0001A777 0x00007D00 0x00000100 0x0040 0x0008
0x00F5 TEMPC11……… 44 27 45 C 0x00 0x00018402 0x00007D00 0x00000100 0x0040 0x0008
0x00F9 TEMPC15……… 36 25 43 C 0x00 0x000183BC 0x00007D00 0x00000100 0x0040 0x0008
0x00FB TEMPC17……… 38 28 41 C 0x00 0x00015474 0x00007D00 0x00000100 0x0040 0x0008
0x00FD TEMPC19……… 43 27 44 C 0x00 0x00016589 0x00007D00 0x00000100 0x0040 0x0008
0x00FF TEMPC21……… 36 30 40 C 0x00 0x00015CA9 0x00007D00 0x00000100 0x0040 0x0008
So what does that mean for physical layout? Well, like all modern high performance chips, the POWER9 is modular, with a bunch of logic being replicated all over the die. The most notable duplicated parts are the core (replicated 24 times!) and cache structures. Less so are memory controllers and PCI hardware.
See that each core (e.g. EC00 and EC01) is paired with the cache block (EC00 and EC01 with EP00). That’s two POWER9 cores with one 512KB L2 cache and one 10MB L3 cache.
You can see the cache layout (including L1 Instruction and Data caches) by looking in sysfs:
$ for i in /sys/devices/system/cpu/cpu0/cache/index*/; \
do echo -n $(cat $i/level) $(cat $i/size) $(cat $i/type); \
echo; done
1 32K Data
1 32K Instruction
2 512K Unified
3 10240K Unified
So, what does the layout of my POWER9 chip look like? Well, thanks to the power of graphics software, we can cross some cores out and look at the topology:
If I run some memory bandwidth benchmarks, I can see that you can see the L3 cache capacity you’d assume from the above diagram: 80MB (10MB/core). Let’s see:
If all the cores were packed together, I’d expect that cliff to be a lot sooner.
So how does this compare to other machines I have around? Well, let’s look at my Ryzen 7. Specifically, a “AMD Ryzen 7 1700 Eight-Core Processor”. The cache layout is:
$ for i in /sys/devices/system/cpu/cpu0/cache/index*/; \
do echo -n $(cat $i/level) $(cat $i/size) $(cat $i/type); \
echo; \
done
1 32K Data
1 64K Instruction
2 512K Unified
3 8192K Unified
And then the performance benchmark similar to the one I ran above on the POWER9 (lower numbers down low as 8MB is less than 10MB)
I’m not sure what performance conclusions we can realistically draw from these curves, apart from “keeping workload to L3 cache is cool”, and “different chips have different cache hardware”, and “I should probably go and read and remember more about the microarchitectural characteristics of the cache hardware in Ryzen 7 hardware and 10th gen Intel Core hardware”.
This post we’re going to look at three different ways to look at various sensors in the Raptor Blackbird system. The Blackbird is a single socket uATX board for the POWER9 processor. One advantage of the system is completely open source firmware, so you can (like I have): build your own firmware. So, this is my Blackbird running my most recent firmware build (the BMC is running the 2.00 release from Raptor).
Sensors over IPMI
One way to get the sensors is over IPMI. This can be done either in-band (as in, from the OS running on the blackbird), or over the network.
stewart@blackbird9$ sudo ipmitool sensor |head
occ | na | discrete | na | na | na | na | na | na | na
occ0 | 0x0 | discrete | 0x0200| na | na | na | na | na | na
occ1 | 0x0 | discrete | 0x0100| na | na | na | na | na | na
p0_core0_temp | na | | na | na | na | na | na | na | na
p0_core1_temp | na | | na | na | na | na | na | na | na
p0_core2_temp | na | | na | na | na | na | na | na | na
p0_core3_temp | 38.000 | degrees C | ok | na | -40.000 | na | 78.000 | 90.000 | na
p0_core4_temp | na | | na | na | na | na | na | na | na
p0_core5_temp | 38.000 | degrees C | ok | na | -40.000 | na | 78.000 | 90.000 | na
p0_core6_temp | na | | na | na | na | na | na | na | na
It’s kind of annoying to read there, so standard unix tools to the rescue!
stewart@blackbird9$ sudo ipmitool sensor | cut -d '|' -f 1,2
occ | na
occ0 | 0x0
occ1 | 0x0
p0_core0_temp | na
p0_core1_temp | na
p0_core2_temp | na
p0_core3_temp | 38.000
p0_core4_temp | na
p0_core5_temp | 38.000
p0_core6_temp | na
p0_core7_temp | 38.000
p0_core8_temp | na
p0_core9_temp | na
p0_core10_temp | na
p0_core11_temp | 37.000
p0_core12_temp | na
p0_core13_temp | na
p0_core14_temp | na
p0_core15_temp | 37.000
p0_core16_temp | na
p0_core17_temp | 37.000
p0_core18_temp | na
p0_core19_temp | 39.000
p0_core20_temp | na
p0_core21_temp | 39.000
p0_core22_temp | na
p0_core23_temp | na
p0_vdd_temp | 40.000
dimm0_temp | 35.000
dimm1_temp | na
dimm2_temp | na
dimm3_temp | na
dimm4_temp | 38.000
dimm5_temp | na
dimm6_temp | na
dimm7_temp | na
dimm8_temp | na
dimm9_temp | na
dimm10_temp | na
dimm11_temp | na
dimm12_temp | na
dimm13_temp | na
dimm14_temp | na
dimm15_temp | na
fan0 | 1200.000
fan1 | 1100.000
fan2 | 1000.000
p0_power | 33.000
p0_vdd_power | 5.000
p0_vdn_power | 9.000
cpu_1_ambient | 30.600
pcie | 27.000
ambient | 26.000
You can see that I have 3 fans, two DIMMs (although why it lists 16 possible DIMMs for a two DIMM slot board is a good question!), and eight CPU cores. More on why the layout of the CPU cores is the way it is in a future post.
The code path for reading these sensors is interesting, it’s all from the BMC, so we’re having the OCC inside the P9 read things, which the BMC then reads, and then passes back to the P9. On the P9 itself, each sensor is a call all the way to firmware and back! In fact, we can look at it in perf:
What are the 0x300xxxxx addresses? They’re the OPAL firmware (i.e. skiboot). We can look up the symbols easily, as the firmware exposes them to the kernel, which then plonks it in sysfs:
[stewart@blackbird9 ~]$ sudo head /sys/firmware/opal/symbol_map
[sudo] password for stewart:
0000000000000000 R __builtin_kernel_end
0000000000000000 R __builtin_kernel_start
0000000000000000 T __head
0000000000000000 T _start
0000000000000010 T fdt_entry
00000000000000f0 t boot_sem
00000000000000f4 t boot_flag
00000000000000f8 T attn_trigger
00000000000000fc T hir_trigger
0000000000000100 t sreset_vector
So we can easily look up exactly where this is:
[stewart@blackbird9 ~]$ sudo grep '18e.. ' /sys/firmware/opal/symbol_map
0000000000018e20 t .__try_lock.isra.0
0000000000018e68 t .add_lock_request
So we’re managing to spend a whole 12% of execution time spinning on a spinlock in firmware! The call stack of what’s going on in firmware isn’t so easy, but we can find the bt_add_ipmi_msg call there which is probably how everything starts:
[stewart@blackbird9 ~]$ sudo grep '516.. ' /sys/firmware/opal/symbol_map 0000000000051614 t .bt_add_ipmi_msg_head 0000000000051688 t .bt_add_ipmi_msg 00000000000516fc t .bt_poll
OCCTOOL
This is the most not-what-you’re-meant-to-use method of getting access to sensors! It’s using a debug tool for the OCC firmware! There’s a variety of tools in the OCC source repositiory, and one of them (occtoolp9) can be used for a variety of things, one of which is getting sensor data out of the OCC.
The odd thing you’ll see is “via opal-prd” – and this is because it’s doing raw calls to the opal-prd binary to talk to the OCC firmware running things like “opal-prd --expert-mode htmgt-passthru“. Yeah, this isn’t a in-production thing :)
Amazingly (and interestingly), this doesn’t go through host firmware in the way that an IPMI call will. There’s a full OCC/Host firmware interface spec to read. But it’s insanely inefficient way to monity sensors, a long bash script shelling out to a whole bunch of other processes… Think ~14.4 billion cycles versus ~367million cycles for the ipmitool option above.
But there are some interesting sensors at the end of the list:
Sensor Details: (found 86 sensors, details only for Status of 0x00)
GUID Name Sample Min Max U Stat Accum UpdFreq ScaleFactr Loc Type
....
0x014A MRDM0……….. 688 3 15015 GBs 0x00 0x0144AE6C 0x00001901 0x000080FB 0x0008 0x0200
0x014E MRDM4……….. 480 3 14739 GBs 0x00 0x01190930 0x00001901 0x000080FB 0x0008 0x0200
0x0156 MWRM0……….. 560 4 16605 GBs 0x00 0x014C61FD 0x00001901 0x000080FB 0x0008 0x0200
0x015A MWRM4……….. 360 4 16597 GBs 0x00 0x014AE231 0x00001901 0x000080FB 0x0008 0x0200
is that memory bandwidth? Well, if I run the STREAM benchmark in a loop and look again:
In what is coming a month occurance, I’ve put up yet another firmware build for the Raptor Blackbird with close-to-upstream firmware (see here and here for previous ones).
Well, I’ve done another build! It’s current op-build (as of yesterday), but my branch with patches for the Raptor Blackbird. The skiboot patch is there, the SBE speedup patch is now upstream. The machine-xml which is straight from Raptor but in my repo.
If we compare this to the last build I put up, we have:
Component
old
new
skiboot
v6.5-209-g179d53df-p4360f95
v6.5-228-g82aed17a-p4360f95
linux
5.4.13-openpower1-pa361bec
5.4.22-openpower1-pdbbf8c8
occ
3ab2921
no change
hostboot
779761d-pe7e80e1
acdff8a-pe7e80e1
buildroot
2019.05.3-14-g17f117295f
2019.05.3-15-g3a4fc2a888
capp-ucode
p9-dd2-v4
no change
machine-xml
site_local-stewart-a0efd66
no change
hostboot-binaries
hw011120a.opmst
hw013120a.opmst
sbe
166b70c-p06fc80c
c318ab0-p1ddf83c
hcode
hw011520a.opmst
hw030220a.opmst
petitboot
v1.11
v1.12
version
blackbird-v2.4-415-gb63b36ef
blackbird-v2.4-514-g62d1a941
So, what do those changes mean? Not too much changed over the past month. Kernel bump, new petitboot (although I can’t find release notes but it doesn’t look like there’s a lot of changes), and slight bumps to other firmware components.
To flash it, copy blackbird.pnor to your Blackbird’s BMC in /tmp/ (important! the /tmp filesystem has enough room, the home directory for root does not), and then run:
pflash -E -p /tmp/blackbird.pnor
Which will ask you to confirm and then flash:
About to erase chip !
WARNING ! This will modify your HOST flash chip content !
Enter "yes" to confirm:yes
Erasing... (may take a while)
[==================================================] 99% ETA:1s
done !
About to program "/tmp/blackbird.pnor" at 0x00000000..0x04000000 !
Programming & Verifying...
[==================================================] 100% ETA:0s
A few weeks ago (okay, close to six), I put up a firmware build for the Raptor Blackbird with close-to-upstream firmware (see here).
Well, I’ve done another build! It’s current op-build (as of this morning), but my branch with patches for the Raptor Blackbird. The skiboot patch is there, as is the SBE speedup patch. Current kernel (works fine with my hardware), current petitboot, and the machine-xml which is straight from Raptor but in my repo.
The Self Boot Engine (SBE) is a small embedded PPE42 core inside the POWER9 CPU which has the unenvious job of getting a single POWER9 core ready enough to start executing instructions out of L3 cache, and poking some instructions into said cache for the core to start executing.
It’s called the “Self Boot Engine” as in generations prior to POWER8, it was the job of the FSP (Service Processor) to do all of the booting for the CPU. On POWER8, there was still an SBE, but it was a custom instruction set (this was the Power On Reset Engine – PORE), while the PPE42 is basically a 32bit powerpc core cut straight down the middle (just the way to make it awkward for toolchains).
One of the things I noted in my post on Booting temporary firmware on the Raptor Blackbird is that we got serial console output from the SBE. It turns out one of thing things explicitly not enabled by Raptor in their build was this output as “it made the SBE boot much slower”. I’d actually long suspected this, but hadn’t really had the time to delve into it.
WARNING: hacking on your SBE firmware can be relatively dangerous, as it’s literally the first thing that needs to work in order to boot the system, and there isn’t (AFAIK) publicly documented easy way to re-flash your SBE firmware if you mess it up.
Seeing as we saw a regression in boot time with the UART output enabled, we need to look at the uartPutChar() function in sbeConsole.C (error paths removed for clarity):
One thing you may notice if you’ve spent some time around serial ports is that it’s not using the transmit FIFO! While according to Wikipedia the original 16550 had a broken FIFO, but we’re certainly not going to be hooked up to an original rev of that silicon.
To compare, let’s look at the skiboot code, which is all in hw/lpc-uart.c:
The uart_check_tx_room() function is pretty simple, it checks if there’s room in the FIFO and knows that there’s 16 entries. Next, we have a busy loop that waits until there’s room again in the FIFO:
static void uart_wait_tx_room(void)
{
while (!tx_room) {
uart_check_tx_room();
if (!tx_room) {
smt_lowest();
do {
barrier();
uart_check_tx_room();
} while (!tx_room);
smt_medium();
}
}
}
Finally, the bit of code that writes the (internal) log buffer out to a serial port:
/*
* Internal console driver (output only)
*/
static size_t uart_con_write(const char *buf, size_t len)
{
size_t written = 0;
/* If LPC bus is bad, we just swallow data */
if (!lpc_ok() && !mmio_uart_base)
return written;
lock(&uart_lock);
while(written < len) {
if (tx_room == 0) {
uart_wait_tx_room();
if (tx_room == 0)
goto bail;
} else {
uart_write(REG_THR, buf[written++]);
tx_room--;
}
}
bail:
unlock(&uart_lock);
return written;
}
The skiboot code ends up being a bit more complicated thanks to a number of reasons, but the basic algorithm could be applied to the SBE code, and rather than busy waiting for each character to be written out before sending the other into the FIFO, we could just splat things down there and continue with life. So, I put together a patch to try out.
Before (i.e. upstream SBE code): it took about 15 seconds from “Welcome to SBE” to “Booting Hostboot”.
Now (with my patch): Around 10 seconds.
It’s a full five seconds (33%) faster to get through the SBE stage of booting. Wow.
Hopefully somebody looks at the pull request sometime soon, as it’s probably useful to a lot of people doing firmware and Operating System development.
So, Happy New Year for Blackbird owners (I’ll publish a build with this and other misc improvements “soon”).
It goes without saying that using this build is a At Your Own Risk and I make zero warranty. AFAIK it can’t physically destroy your system.
My GitHub op-build branch stewart-blackbird-v1 has all the changes built into this build (the VERSION displayed in firmware will be slightly weird as I did the tagging afterwards… this is not meant to be “howto release firmware to the public”). Follow op-build pull 3341 for the state of upstreaming everything.
Linux v5.4.3-openpower1 because 5.3.7 in the current op-build repo has an NVME driver bug that means it didn’t recognize my Intel NVME drive, and thus prevented me from actually booting an OS.
Apart from that, I was all happy as Larry. Except then I went into the room with the Blackbird in it an went “huh, that’s loud”, and since it was bedtime, I decided it could all wait until the morning.
It is now the morning. Checking fan speeds over IPMI, one fan stood out (fan2, sitting at 4300RPM). This was a bit of a surprise as what’s silkscreened on the board is that the rear case fan is hooked up to ‘fan2″, and if we had a “start from 0/1” mix up, it’d be the front case fan. I had just assumed it’d be maybe OCC firmware dying or something, but this wasn’t the case (I checked – thanks occtoolp9!)
After a bit of digging around, I worked out this mapping:
IPMI fan0
Rear Case Fan
Motherboard Fan 2
IPMI fan1
Front Case Fan
Motherboard Fan 3
IPMI fan2
CPU Fan
Motherboard Fan 1
Which is about as surprising and confusing as you’d think.
After a bunch of digging around the Raptor ports of OpenBMC and Hostboot, it seems that the IPL Observer which is custom to Raptor controls if the BMC decides to do fan control or not.
You can get its view of the world from the BMC via the (incredibly user friendly) poking at DBus:
Which if you just have the Hostboot patch in (like I first did) you end up with:
s "IPL_RUNNING"
s "21,3"
Which is where Hostboot exits the IPL process (as you see on the screen) and hands over to skiboot. But if you start digging through their op-build tree, you find that there’s a signal_linux_start_complete script which calls pnv-lpc to write two values to LPC ports 0x81 and 0x82. The pnv-lpc utility is the external/lpc/ binary from skiboot, and these two ports are the “extended lpc port 80h” state.
So, to get back fan control? First, build the lpc utility:
git clone git@github.com:open-power/skiboot.git
cd skiboot/external/lpc
make
and then poke the magic values of “IPL complete and linux running”:
$ sudo ./lpc io 0x81.b=254
[io] W 0x00000081.b=0xfe
$ sudo ./lpc io 0x82.b=254
[io] W 0x00000082.b=0xfe
You get a friendly beep, and then your fans return to sanity.
Of course, for that to work you need to have debugfs mounted, as this pokes OPAL debugfs to do direct LPC operations.
Next up: think of a smarter way to trigger that than “stewart runs it on the command line”. Also next up: work out the better way to determine that fan control should be on and patch the BMC.
In a future post, I’ll detail how to build my ported-to-upstream Blackbird firmware. Here though, we’ll explore booting some firmware temporarily to experiment.
Step 1: Copy your new PNOR image over to the BMC. Step 2: … Step 3: Profit!
Okay, not really, once you’ve copied over your image, ensure the computer is off and then you can tell the daemon that provides firmware to the host to use a file backend for it rather than the PNOR chip on the motherboard (i.e. yes, you can boot your system even when the firmware chip isn’t there – although I’ve not literally tried this).
One of the improvements is we now get output from the SBE! This means that when we do things like mess up secure boot and non secure boot firmware (I’ll explain why/how this is a thing later), we’ll actually get something useful out of a serial port:
And then we’re back into normal Hostboot boot (which we’ve all seen before) and end up at a newer petitboot!
One notable absence from that screenshot is my installed Fedora is missing. This is because there appears to be a bug in the 5.3.7 kernel that’s currently upstream, and if we drop to the shell and poke at lspci and dmesg, we can work out what could be the culprit:
Exiting petitboot. Type 'exit' to return.
You may run 'pb-sos' to gather diagnostic data
No password set, running as root. You may set a password in the System Configuration screen.
# lspci
0000:00:00.0 PCI bridge: IBM Device 04c1
0001:00:00.0 PCI bridge: IBM Device 04c1
0001:01:00.0 Non-Volatile memory controller: Intel Corporation Device f1a8 (rev 03)
0002:00:00.0 PCI bridge: IBM Device 04c1
0002:01:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9235 PCIe 2.0 x2 4-port SATA 6 Gb/s Controller (rev 11)
0003:00:00.0 PCI bridge: IBM Device 04c1
0003:01:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB 3.0 xHCI Host Controller (rev 02)
0004:00:00.0 PCI bridge: IBM Device 04c1
0004:01:00.0 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0004:01:00.1 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0004:01:00.2 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0005:00:00.0 PCI bridge: IBM Device 04c1
0005:01:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 04)
0005:02:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
# dmesg|grep -i nvme
[ 2.991038] nvme nvme0: pci function 0001:01:00.0
[ 2.991088] nvme 0001:01:00.0: enabling device (0140 -> 0142)
[ 3.121799] nvme nvme0: Identify Controller failed (19)
[ 3.121802] nvme nvme0: Removing after probe failure status: -5
# uname -a
Linux skiroot 5.3.7-openpower1 #2 SMP Sat Dec 14 09:06:20 PST 2019 ppc64le GNU/Linux
If for some reason the device didn’t show up in lspci, then I’d look at the skiboot firmware log, which is /sys/firmware/opal/msglog.
Looking at upstream stable kernel patches, it seems like 5.3.8 has a interesting looking patch when you realize that ppc64le uses a 64k page size:
commit efac0f186ea654e8389f5017c7f643ef48cb4b93
Author: Kevin Hao <haokexin@gmail.com>
Date: Fri Oct 18 10:53:14 2019 +0800
nvme-pci: Set the prp2 correctly when using more than 4k page
commit a4f40484e7f1dff56bb9f286cc59ffa36e0259eb upstream.
In the current code, the nvme is using a fixed 4k PRP entry size,
but if the kernel use a page size which is more than 4k, we should
consider the situation that the bv_offset may be larger than the
dev->ctrl.page_size. Otherwise we may miss setting the prp2 and then
cause the command can't be executed correctly.
Fixes: dff824b2aadb ("nvme-pci: optimize mapping of small single segment requests")
Cc: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
So, time to go try 5.3.8. My yaks are getting quite smooth.
Oh, and when you’re done with your temporary firmware, either fiddle with mboxctl or restart the systemd service for it, or reboot your BMC or… well, I gotta leave you something to work out on your own :)
Now that I can actually boot the machine, I could test and send my patch upstream for Blackbird support in skiboot. One thing I noticed with the current firmware from Raptor is that the PCIe slot names were wrong. While a pretty minor point, it’s a bit funny that there’s only two slots and the names were wrong.
The PCIe slot names are used to call out the physical location of PCIe cards in the system, so if you, say, hit a bunch of errors, OS/firmware can say “It’s this card in the slot labeled BLAH on the board”.
With my patch, the slot table from skiboot is spat out looking like this:
If you want to give it a go, grab the patch, build skiboot, and flash it on. Alternatively, you can download a built skiboot here. To flash it, do this:
# Copy to your BMC for the Blackbird
scp skiboot-v6.5-146-g376bed3f.lid.xz.stb root@blackbird:/tmp/
# then, ssh to the BMC
$ ssh root@blackbird
# ensure the machine is off
obmcutil poweroff --wait
# Now, make a backup copy (remember to copy it off /tmp on the bmc)
pflash -P PAYLOAD -r /tmp/skiboot-backup
# and flash the new skiboot:
pflash -e -P PAYLOAD -p /tmp/skiboot.lid.xz.stb
# now, power on the box
obmcutil poweron
Well, after the half false start of not having RAM so really not being able to do much (yeah yeah, I hear you – I’m weak for not just running Linux in L3), my RAM arrived today. Putting the sticks in was easy (of course), although does not make for an exciting photo.
After that, I SSH’d the the BMC and then did “obmcutil poweron” (as is traditional) and started looking at the console via conneting via SSH to port 2200 on the BMC. I was then greeted by the (by this time in my life rather familiar) Hostboot:
The first IPL updated the Self Boot Engine firmware on the chip, so it automatically applied the new firmware and rebooted to finish applying it. This is perfectly normal, it just shows itself as a longer boot time. Booting continues:
The rest of the skiboot log was also spat out, and then the familiar Petitboot screen:
It lives! I even had a bit of a look at the sensors to see power consumption and temperatures. All looks good:
ipmitool sdr|grep -v ns
occ0 | 0x00 | ok
occ1 | 0x00 | ok
p0_core3_temp | 51 degrees C | ok
p0_core5_temp | 49 degrees C | ok
p0_core7_temp | 50 degrees C | ok
p0_core11_temp | 49 degrees C | ok
p0_core15_temp | 50 degrees C | ok
p0_core17_temp | 50 degrees C | ok
p0_core19_temp | 50 degrees C | ok
p0_core21_temp | 50 degrees C | ok
dimm0_temp | 36 degrees C | ok
dimm4_temp | 39 degrees C | ok
fan0 | 1300 RPM | ok
fan1 | 1200 RPM | ok
fan2 | 1000 RPM | ok
p0_power | 60 Watts | ok
p0_vdd_power | 31 Watts | ok
p0_vdn_power | 10 Watts | ok
cpu_1_ambient | 30.90 degrees C | ok
pcie | 27 degrees C | ok
ambient | 25.40 degrees C | ok
Way back when Raptor Computer Systems was doing pre-orders for the microATX Blackboard POWER9 system, I put in a pre-order. Since then, I’ve had a few life changes (such as moving to the US and starting to work for Amazon rather than IBM), but I’ve finally gone and done (most of) the setup for my own POWER9 system on (or under) my desk.
Everything came in a big brown box, all rather well packed. I had the board, CPU, heatsink assembly and the special tool to attach the heatsink to the board. Although unique to POWER9, the heatsink/fan assembly was one of the easier ones I’ve ever attached to a board.
The board itself looks pretty much as you’d expect – there’s a big spot for the CPU, a couple of PCI slots, a couple of DIMM slots and some SATA connectors.
The bits that are a bit unusual for a micro-ATX board are the big space reserved for FlexVer, the ASPEED BMC chip and the socketed flash. FlexVer is something I’m not ever going to use, and instead wish that there was an on-board m2 SSD slot instead, even if it was just PCIe. Having to sacrifice a PCIe slot just for a SSD is kind of a bummer.
One annoying thing is my DIMMs are taking their sweet time in getting here, so I couldn’t actually populate the board with any memory.
Even without memory though, you can start powering it on and see that everything else works okay (i.e. it’s not completely boned). So, even without DIMMs, I could plug it in, and observe the Hostboot firmware complaining about insufficient hardware to IPL the box.
It Lives!
Yep, out the console (via ssh) you clearly see where things fail:
--== Welcome to Hostboot hostboot-3beba24/hbicore.bin ==--
3.03104|secure|SecureROM valid - enabling functionality
6.67619|Booting from SBE side 0 on master proc=00050000
6.85100|ISTEP 6. 5 - host_init_fsi
7.23753|ISTEP 6. 6 - host_set_ipl_parms
7.71759|ISTEP 6. 7 - host_discover_targets
11.34738|HWAS|PRESENT> Proc[05]=8000000000000000
11.34739|HWAS|PRESENT> Core[07]=1511540000000000
11.69077|ISTEP 6. 8 - host_update_master_tpm
11.73787|SECURE|Security Access Bit> 0x0000000000000000
11.73787|SECURE|Secure Mode Disable (via Jumper)> 0x8000000000000000
11.76276|ISTEP 6. 9 - host_gard
11.96654|HWAS|FUNCTIONAL> Proc[05]=8000000000000000
11.96655|HWAS|FUNCTIONAL> Core[07]=1511540000000000
12.07554|================================================
12.07554|Error reported by hwas (0x0C00) PLID 0x90000007
12.10289| checkMinimumHardware found no functional dimm cards.
12.10290| ModuleId 0x03 MOD_CHECK_MIN_HW
12.10291| ReasonCode 0x0c06 RC_SYSAVAIL_NO_MEMORY_FUNC
12.10292| UserData1 HUID of node : 0x0002000000000000
12.10293| UserData2 number of present, non-functional dimms : 0x0000000000000000
12.10294|------------------------------------------------
12.10417| Callout type : Procedure Callout
12.10417| Procedure : EPUB_PRC_FIND_DECONFIGURED_PART
12.10418| Priority : SRCI_PRIORITY_HIGH
12.10419|------------------------------------------------
12.10420| Hostboot Build ID: hostboot-3beba24/hbicore.bin
12.10421|================================================
12.51718|================================================
12.51719|Error reported by hwas (0x0C00) PLID 0x90000007
12.51720| Insufficient hardware to continue.
12.51721| ModuleId 0x03 MOD_CHECK_MIN_HW
12.51722| ReasonCode 0x0c04 RC_SYSAVAIL_INSUFFICIENT_HW
12.54457| UserData1 : 0x0000000000000000
12.54458| UserData2 : 0x0000000000000000
12.54458|------------------------------------------------
12.54459| Callout type : Procedure Callout
12.54460| Procedure : EPUB_PRC_FIND_DECONFIGURED_PART
12.54461| Priority : SRCI_PRIORITY_HIGH
12.54462|------------------------------------------------
12.54462| Hostboot Build ID: hostboot-3beba24/hbicore.bin
12.54463|================================================
12.73660|System shutting down with error status 0x90000007
12.75545|================================================
12.75546|Error reported by istep (0x1700) PLID 0x90000007
12.77991| IStep failed, see other log(s) with the same PLID for reason.
12.77992| ModuleId 0x01 MOD_REPORTING_ERROR
12.77993| ReasonCode 0x1703 RC_FAILURE
12.77994| UserData1 eid of first error : 0x9000000800000c04
12.77995| UserData2 Reason code of first error : 0x0000000100000609
12.77996|------------------------------------------------
12.77996| host_gard
12.77997|------------------------------------------------
12.77998| Callout type : Procedure Callout
12.77998| Procedure : EPUB_PRC_HB_CODE
12.77999| Priority : SRCI_PRIORITY_LOW
12.78000|------------------------------------------------
12.78001| Hostboot Build ID: hostboot-3beba24/hbicore.bin
12.78002|================================================
Looking forward to getting some DIMMs to show/share more.
So, with my new Blackbird system I decided to take a bit of a look as to what the firmware situation was like.
There’s two main parts of firmware: BMC and Host. The BMC firmware runs purely on the ASPEED AST2500 and is based on OpenBMC while the host firmware is what runs on the POWER9 and is based off of OpenPOWER Firmware as assembled by op-build.
Initial impressions on the BMC is that there doesn’t seem to be any web based UI for it, which is kind of disappointing, as the Web UI being developed upstream has some nice qualities, and I’d say I even enjoyed using it when it was built into BMC firmware for systems we had when I was at IBM.
Looking at the git trees, the raptor-v1.00 tag is OpenBMC 2.7.0-dev-533-g386e5602e while current master is 2.8.0-dev-960-g10f7830bd. The spot where it split off was 2.7.0-dev-430-g7443ee80b, from April 2019 – so it’s not too old, but I’m also not convinced there should have been some security patches since then.
I’m not sure if any of the OpenBMC code is upstream, I haven’t looked.
Unfortunately, none of the host firmware is upstream.
On the host firmware side, v2.3-rc2-67-ga6a5f142 is the Raptor tag, and that compares with current master of v2.4-305-g54d8daf4, the place where Raptor forked was v2.3-rc2-9-g7b556015, again in April of 2019. Considering there was an upstream release in May of 2019 (v2.3), and again in July (v2.4), it could have easily have made it into an upstream release.
Unfortunately, there doesn’t seem to have been an upstream op-build release since v2.4 back in July (when I made it shortly before leaving IBM).
The skiboot component of host firmware has had an upstream release since I left (v6.5 in mid-August 2019), so the (rather trivial) platform support could have easily made it. I have a cleaned up and ready to upstream patch for it, I just need some DIMMs to actually test with before I send the patch.
As the current firmware situation stands, producing another build with updated upstream code is tricky due to the out-of-tree nature of the Blackbird patches, and a straight “git merge” is probably doable by some people, but not everybody.
On my TODO list is to get all the code into a state I can upstream it, assess vulnerability to CVE-2019-6260, and work out how I want to make it do Secure Boot (something that isn’t in upstream firmware yet, and currently would require a TPM, which I do not have).
On OpenPOWER POWER9 systems, we typically talk to the flash chips that hold firmware for the host (i.e. the POWER9) processor through a daemon running on the BMC (aka service processor) rather than directly.
We have host firmware map “windows” on the LPC bus to parts of the flash chip. This flash chip can in fact be a virtual one, constructed dynamically from files on the BMC.
Since we’re mapping windows into this flash address space, we have some knowledge as to what IO the host is doing to/from the pnor. We can use this to output data in the blktrace format and feed into existing tools used to analyze IO patterns.
So, with a bit of learning of the data format and learning how to drive the various tools, I was ready to patch the BMC daemon (mboxbridge) to get some data out.
An initial bit of data is a graph of the windows into PNOR opened up during an normal boot (see below).
PNOR windows created over the course of a normal boot.
This shows us that over the course of the boot, we open a bunch of windows, and switch them around a fair bit early on. This makes sense as early in boot we do not yet have DRAM working and page in firmware on-demand into L3 cache.
Later in boot, you can see the loading of larger chunks of firmware into memory. It’s also possible to see that this seems to take longer than it should – and indeed, we have a bug there.
Next, by modifying the code again, I introduced recording of when we used a window that the BMC had already cached. While the host will only see one window at a time, the BMC can keep around the ones it prepared earlier in order to avoid IO to the actual flash chips (which are SPI flash, so aren’t incredibly fast).
Here we can see that we’re likely not doing the most efficient things during boot, and there’s probably room for some optimization.
Normal boot but including re-used Windows rather than just created ones
Finally, in order to get finer grained information, I reduced the window size from one megabyte down to 4096 bytes. This will impose a heavy speed penalty as it’ll mean we will have to create a lot more windows to do the same amount of IO, but it means that since we’re using the page size of hostboot, we’ll see each individual page in/out operation that it does during boot.
So, from the next graph, we can see that there’s several “hot” areas of the image, and on the whole it’s not too many pages. This gives us a hint that a bit of effort to reduce binary image size a little bit could greatly reduce the amount of IO we have to do.
4096 byte (i.e. page) size window, capturing the bits of flash we need to read in several times due to being low on memory when we’re L3 cache constrained.
The iowatcher tool also can construct a video of the boot and what “blocks” are being read.
Video of what blocks are read from flash during booting
So, what do we get from this adventure? Well, we get a good list of things to look into in order to improve boot performance, and we get to back these up with data rather than guesswork. Since this also works on unmodified host firmware, we’re measuring what we really boot rather than changing it in order to measure it.
You may have heard of ccache (Compiler Cache) which saves you heaps of real world time when rebuilding a source tree that is rather similar to one you’ve recently built before. It’s really useful in buildroot based projects where you’re building similar trees, or have done a minor bump of some components.
In trying to find a commit which introduced a bug in op-build (OpenPOWER firmware), I noticed that hostboot wasn’t being built using ccache and we were always doing a full build. So, I started digging into it.
It turns out that a bunch of the perl scripts for parsing the Machine Readable Workbook XML in hostboot did a bunch of things like foreach $key (%hash) – which means that the code iterates over the items in hash order rather than an order that would produce predictable output such as “attribute name” or something. So… much messing with that later, I had hostboot generating the same output for the same input on every build.
Next step was to work out why I was still getting a lot of CCACHE misses. It turns out the default ccache size is 5GB. A full hostboot build uses around 7.1GB of that.
So, if building op-build with CCACHE, be sure to set both BR2_CCACHE=y in your config as well as something like BR2_CCACHE_INITIAL_SETUP="--max-size 20G"
Hopefully my patches hit hostboot and op-build soon.
I’ve been working on trying to better document the whole flow of code that goes into a build of firmware for an OpenPOWER machine. This is partially to help those not familiar with it get a better grasp of the sheer scale of what goes into that 32/64MB of flash.
I also wanted to convey the components that we heavily re-used from other Open Source projects, what parts are still “IBM internal” (as they relate to the open source workflow) and which bits are primarily contributed to by IBMers (at least at this point in time).
As such, let’s start with the legend of the diagram:
Now, the diagram:
The end thing that a user with a machine will download and apply (or that comes shipped with a box) is the purple “Installable Firmware Release” nodes (bottom center). In this diagram, there are 4 of them. One for POWER9 systems such as the just-announced AC922 system (this is the “OP910 Release” node, which is the witherspoon_defconfig in the op-build tree); one for the p9dsu platform (p9dsu_defconfig in op-build) and one is for IBM FSP based systems such as the S812L and S822L systems (or S812/S822 in OPAL mode).
There are more platforms out there, but this diagram is meant to be simplified. The key difference with the p9dsu platform is that this is produced by somebody other than IBM.
All of these releases are based off the upstream op-build project, op-build is the light blue box in the center of the diagram. We do regular X.Y releases and sometimes do X.Y.Z releases. It’s primarily a pull request based workflow currently, so everything goes via a pull request. The op-build project brings together all the POWER specific firmware components (pretty much everything in every other light blue/blue box) along with a Linux kernel and buildroot.
The kernel and buildroot are the two big yellow boxes on the top right. Buildroot brings together a lot of open source components that are in our firmware image (including some power specific ones that we get through upstream buildroot).
For Linux, this is a pretty simplified view of the process, but we primarily ship the stable tree (with maybe up to half a dozen patches).
The skiboot and petitboot components both use a mailing list based workflow (similar to kernel) as well as X.Y and X.Y.Z releases (again, similar to the linux kernel).
On the far left of the diagram, we have Hostboot, SBE and OCC. These are three firmware components that come from the traditional IBM POWER Firmware group, and are shared with the IBM non-OpenPOWER POWER systems (“traditional” POWER). These components have part of their code from from an (internal) repository called “ekb” which also goes into a (very) low level debug tool and the FSP based systems. There’s also an (internal) gerrit instance that’s the primary place where code review/development discussions are for these components.
In future posts, I’ll probably delve into more specifics of the current development process, and how we may try and change things for the better.
Recently, I added the package lrzsz to op-build in this commit. This package provides the rz and sz commands – for receive zmodem and send zmodem respectively. For those who don’t know, op-build builds a firmware image for OpenPOWER machines, and adding this package adds the commands to the petitboot shell (the busybox environment you get when you “exit to shell” from the boot menu).
For those who aren’t familiar with ZMODEM, you should first get off my lawn, and secondly, know that it’s a method for sending/receiving files over something akin to a serial port, e.g. a modem. The basic protocol is “I want to send you this file named FOO”, “okay, I would like to receive it”, “here’s some data and a checksum, did you get it and does it match the checksum?”, “yes!”, “okay, great, here’s the next bit” until the file is transferred. Importantly, it has a provision for “no, I didn’t get that right” and for bits to be resent.
The one thing that pretty much always somewhat works on a computer is a serial port (or something that looks like a serial port to software). When you connect to the IPMI console (“ipmitool sol activate”), the host sees this as a serial port that it pumps bytes over. With OpenBMC, you can actually connect to this serial port via SSH.
When diagnosing weird problems during firmware bringup such as “why doesn’t PCI work” or “why does my network adapter not work” (or, perhaps, somebody helpfully didn’t plug the network cable in), it can be useful to copy off a bunch of debug information from the machine.
You may say “can’t you just print the log file to the screen and save it?” and you’d be right, you can do that for text – it’s really annoying for binary data though. Plus, there have been bugs in the console implementations on pretty much every BMC I’ve ever used that makes them not as reliable as you’d like.
So, how could we transfer a file over the serial connection we have to the machine? The same way we did on a BBS! Enter ZMODEM. The error recovery properties are perfect in this situation.
So, how do you use it? I’ve found two ways that work well: GNU screen and zssh. For GNU Screen, you’ll want to configure it to catch zmodem traffic by doing “control-a:zmodem catch<ENTER>” (you need the colon there). After that if you execute “sz” on the host and the rest should be obvious! If you wanted to send a file to the host, run “rz” rather than “sz”.
One of the things I’ve been working on fairly quietly is the test suite for OpenPOWER firmware: op-test-framework. I’ve approach things I’m hacking on from the goal of “when I merge patches into skiboot, can I be confident that I haven’t merged something that’s broken existing functionality?”
By testing host firmware, we incidentally (as well as on purpose) test a whole bunch of BMC functionality. One bit of functionality we rely on a lot is the host “serial” console. Typically, this is exposed to the user over IPMI Serial Over LAN (SOL), or on OpenBMC it’s also exposed as something you can ssh to (which proves to be both faster and more reliable than IPMI, not to mention there’s some remote semblance of security).
When running through some tests, I noticed something pretty odd, it appeared as though we were sometimes missing some console output on larger IOs. This usually isn’t a problem as when we’re using expect(1) (or the python equivalent pexpect) we end up having all sorts of delays here there and everywhere to work around all the terrible things you hope you never learn. So… how could I test that? Well.. what about checking the output of something like dd if=/dev/zero bs=1024 count=16|hexdump -C to see if we get the full output?
Time to add a test to op-test-framework! Adding such a test is pretty easy. If we look at the source of the test I added, we can see what happens (source here).
class Console8k(Console, unittest.TestCase):
bs = 1024
count = 8
class Console16k(Console, unittest.TestCase):
bs = 1024
count = 16
class Console32k(Console, unittest.TestCase):
bs = 1024
count = 32
The setUp() function is pure boiler plate, we grab some objects from the configuration of the test run, namely information about the BMC and the system itself, so we can do things to both. The real magic happens in runTest().
op-test-framework tracks the state of the machine being tested across each test. This means that if we’re executing 101 tests in the petitboot shell, we don’t need to do 101 separate boots to petitboot. The self.system.goto_state(OpSystemState.PETITBOOT_SHELL) statement says “Please ensure the system is booted to the petitboot shell”. Other states include OFF (obvious) and OS, which is when the machine is booted to the OS.
The next two lines ensure we can run commands on the console (where console is IPMI Serial over LAN or other similar connection, such as the SSH console provided by OpenBMC):
The host_console_unique_prompt() call is a bit ugly, and I’m hoping we fix the APIs so that this isn’t needed. Basically, it sets things up so that pexpect will work properly.
The bit that does the work is the try/except block along with the assertTrue. We don’t currently check that the content is all correct, we just check we got the right *amount* of content.
It turns out, this check is enough to reveal a bug that turns out to be deep in the core Linux TTY layer, and has caused Jeremy some amount of fun (for certain values of fun).