So, following on from my post on Sensors on the Blackbird (and thus Power9), I mentioned that when you look at the temperature sensors for each CPU core in my 8-core POWER9 chip, they’re not linear numbers. Let’s look at what that means….
stewart@blackbird9$ sudo ipmitool sensor | grep core p0_core0_temp | na p0_core1_temp | na p0_core2_temp | na p0_core3_temp | 38.000 p0_core4_temp | na p0_core5_temp | 38.000 p0_core6_temp | na p0_core7_temp | 38.000 p0_core8_temp | na p0_core9_temp | na p0_core10_temp | na p0_core11_temp | 37.000 p0_core12_temp | na p0_core13_temp | na p0_core14_temp | na p0_core15_temp | 37.000 p0_core16_temp | na p0_core17_temp | 37.000 p0_core18_temp | na p0_core19_temp | 39.000 p0_core20_temp | na p0_core21_temp | 39.000 p0_core22_temp | na p0_core23_temp | na
You can see I have eight CPU cores in my Blackbird system. The reason the 8 CPU cores are core 3, 5, 7, 11, 15, 17, 19, and 21 rather than 0-8 or something is that these represent the core numbers on the physical die, and the die is a 24 core die. When you’re making a chip as big and as complex as modern high performance CPUs, not all of the chips coming out of your fab are going to be perfect, so this is how you get different models in the line with only one production line.
Weirdly, the output from the hwmon sensors and why there’s a “core 24” and a “core 28”. That’s just… wrong. What it is, however, is right if you think of 8*4=32. This is a product of Linux thinking that Thread=Core in some ways. So, yeah, this numbering is the first thread of each logical core.
[stewart@blackbird9 ~]$ sensors|grep -i core Chip 0 Core 0: +39.0°C (lowest = +25.0°C, highest = +71.0°C) Chip 0 Core 4: +39.0°C (lowest = +26.0°C, highest = +66.0°C) Chip 0 Core 8: +39.0°C (lowest = +27.0°C, highest = +67.0°C) Chip 0 Core 12: +39.0°C (lowest = +26.0°C, highest = +67.0°C) Chip 0 Core 16: +39.0°C (lowest = +25.0°C, highest = +67.0°C) Chip 0 Core 20: +39.0°C (lowest = +26.0°C, highest = +69.0°C) Chip 0 Core 24: +39.0°C (lowest = +27.0°C, highest = +67.0°C) Chip 0 Core 28: +39.0°C (lowest = +27.0°C, highest = +64.0°C)
But let’s ignore that, go from the IPMI sensors (which also match what the OCC shows with “occtoolp9 -LS
” (see below).
$ ./occtoolp9 -SL Sensor Details: (found 86 sensors, details only for Status of 0x00) GUID Name Sample Min Max U Stat Accum UpdFreq ScaleFactr Loc Type .... 0x00ED TEMPC03……… 47 29 47 C 0x00 0x00037CF2 0x00007D00 0x00000100 0x0040 0x0008 0x00EF TEMPC05……… 37 26 39 C 0x00 0x00014E53 0x00007D00 0x00000100 0x0040 0x0008 0x00F1 TEMPC07……… 46 28 46 C 0x00 0x0001A777 0x00007D00 0x00000100 0x0040 0x0008 0x00F5 TEMPC11……… 44 27 45 C 0x00 0x00018402 0x00007D00 0x00000100 0x0040 0x0008 0x00F9 TEMPC15……… 36 25 43 C 0x00 0x000183BC 0x00007D00 0x00000100 0x0040 0x0008 0x00FB TEMPC17……… 38 28 41 C 0x00 0x00015474 0x00007D00 0x00000100 0x0040 0x0008 0x00FD TEMPC19……… 43 27 44 C 0x00 0x00016589 0x00007D00 0x00000100 0x0040 0x0008 0x00FF TEMPC21……… 36 30 40 C 0x00 0x00015CA9 0x00007D00 0x00000100 0x0040 0x0008
So what does that mean for physical layout? Well, like all modern high performance chips, the POWER9 is modular, with a bunch of logic being replicated all over the die. The most notable duplicated parts are the core (replicated 24 times!) and cache structures. Less so are memory controllers and PCI hardware.
See that each core (e.g. EC00 and EC01) is paired with the cache block (EC00 and EC01 with EP00). That’s two POWER9 cores with one 512KB L2 cache and one 10MB L3 cache.
You can see the cache layout (including L1 Instruction and Data caches) by looking in sysfs
:
$ for i in /sys/devices/system/cpu/cpu0/cache/index*/; \ do echo -n $(cat $i/level) $(cat $i/size) $(cat $i/type); \ echo; done 1 32K Data 1 32K Instruction 2 512K Unified 3 10240K Unified
So, what does the layout of my POWER9 chip look like? Well, thanks to the power of graphics software, we can cross some cores out and look at the topology:
If I run some memory bandwidth benchmarks, I can see that you can see the L3 cache capacity you’d assume from the above diagram: 80MB (10MB/core). Let’s see:
[stewart@blackbird9 lmbench3]$ for i in 5M 10M 20M 30M 40M 50M 60M 70M 80M 500M; \ do echo -n "$i "; \ ./bin/bw_mem -N 100 $i rd; \ done 5M 5.24 63971.98 10M 10.49 31940.14 20M 20.97 17620.16 30M 31.46 18540.64 40M 41.94 18831.06 50M 52.43 17372.03 60M 62.91 16072.18 70M 73.40 14873.42 80M 83.89 14150.82 500M 524.29 14421.35
If all the cores were packed together, I’d expect that cliff to be a lot sooner.
So how does this compare to other machines I have around? Well, let’s look at my Ryzen 7. Specifically, a “AMD Ryzen 7 1700 Eight-Core Processor”. The cache layout is:
$ for i in /sys/devices/system/cpu/cpu0/cache/index*/; \ do echo -n $(cat $i/level) $(cat $i/size) $(cat $i/type); \ echo; \ done 1 32K Data 1 64K Instruction 2 512K Unified 3 8192K Unified
And then the performance benchmark similar to the one I ran above on the POWER9 (lower numbers down low as 8MB is less than 10MB)
$ for i in 4M 8M 16M 24M 32M 40M 48M 56M 64M 72M 80M 500M; \ do echo -n "$i "; ./bin/x86_64-linux-gnu/bw_mem -N 10 $i rd;\ done 4M 4.19 61111.04 8M 8.39 28596.55 16M 16.78 21415.12 24M 25.17 20153.57 32M 33.55 20448.20 40M 41.94 20940.11 48M 50.33 20281.39 56M 58.72 21600.24 64M 67.11 21284.13 72M 75.50 20596.18 80M 83.89 20802.40 500M 524.29 21489.27
And my laptop? It’s a four core part, specifically a “Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
” with a cache layout like:
$ for i in /sys/devices/system/cpu/cpu0/cache/index*/; \ do echo -n $(cat $i/level) $(cat $i/size) $(cat $i/type); \ echo; \ done 1 32K Data 1 32K Instruction 2 256K Unified 3 6144K Unified
$ for i in 3M 6M 12M 18M 24M 30M 36M 42M 500M; \ do echo -n "$i "; ./bin/x86_64-linux-gnu/bw_mem -N 10 $i rd;\ done 3M 3.15 48500.24 6M 6.29 27144.16 12M 12.58 18731.80 18M 18.87 17757.74 24M 25.17 17154.12 30M 31.46 17135.87 36M 37.75 16899.75 42M 44.04 16865.44 500M 524.29 16817.10
I’m not sure what performance conclusions we can realistically draw from these curves, apart from “keeping workload to L3 cache is cool”, and “different chips have different cache hardware”, and “I should probably go and read and remember more about the microarchitectural characteristics of the cache hardware in Ryzen 7 hardware and 10th gen Intel Core hardware”.
Something’s wrong on your description or in description of Raptor who claims that 8-core P9 does have unpaired L2 and L3. See https://www.raptorcs.com/content/CP9M32/intro.html — so my guess is that your picture of P9 is wrong or its description is wrong. Am I right?
You are, I counted on a diagram I messed up earlier. I’ll fix it :)