16 March 2010

System monitoring beyond GNOME's System Monitor

I've been having problems with my quad-core computer lately. I noticed that 2 out of 4 logical cores were misbehaving, i.e. they were not scaling their frequency up to meet the load I was putting on them and there were also fishy temperature readings.

You may ask yourself how to monitor things like load, frequency and temperature on a Linux box. The answer isn't all that complicated. If you want a graphical program, there are many: sensors-applet for GNOME's bars, conky and gkrellm for your desktop and probably many others I don't know about. All of them need some level of setting up, please look at the relevant docs on te tools' websites.

What about console tools? The main tools here are lm-sensors, hddtemp and acpi. For experienced users console use is usually simpler, faster and more importantly, more precise. Setting up lm-sensors is simple. Running the following and pressing enter a bunch of times will tell you which drivers (modules, actually) you need to load so that sensors can reliably be read.
$ sudo sensors-detect
...
To load everything that is needed, add this to /etc/modules:

#----cut here----
# Chip drivers
it87
coretemp
#----cut here----

Do you want to add these lines automatically? (yes/NO)
My computer has an Intel Core 2 Quad with embedded on-die sensors (module coretemp) and a motherboard based on the Intel P45 chipset (module it87).

Adding the modules to the mentioned file will cause them to be loaded every system start but we want to read stuf ASAP!
$ sudo modprobe coretemp
watch sensors
The watch command is useful here as it executes whatever command every given time period. The default is 2 seconds, enough for my purposes.
Every 2.0s: sensors                                     Mon Mar 15 17:43:13 2010

ERROR: Can't get value of subfeature temp1_input: Can't read
coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +0.0C  (high = +82.0C, crit = +100.0C)  ALARM

coretemp-isa-0001
Adapter: ISA adapter
Core 1:      +37.0C  (high = +82.0C, crit = +100.0C)

coretemp-isa-0002
Adapter: ISA adapter
Core 2:      +58.0C  (high = +82.0C, crit = +100.0C)

coretemp-isa-0003
Adapter: ISA adapter
Core 3:      +36.0C  (high = +82.0C, crit = +100.0C)
Definitely something amiss with the CPU or the LGA775 socket with its bend-prone pins. Please note that an Intel stock cooler is installed for service-personnel-excuse-finding-avoidance purposes.

One more thing to monitor is the CPU frequency. Modern CPUs need to be green so features originating in laptops came to the desktop, specifically frequency scaling based on CPU load. A good way of watching this is
$ watch grep MHz /proc/cpuinfo
. There are tons of information about the CPU cores in that file and getting just what we need out of it is just one scalpel^W grep away!

Now we need to generate load on the processor cores. There benchmarking tools on Linux aren't as easy to use as on Windows but many of them are quite interesting. I like the Phoronix Test Suite, used by many websites to test in Linux and encompasses so many tests that you really need to pick and choose. A good way of seeing what results should be is comparing your own to other systems on Phoronix Global. One of the better tests in this suite is sunflow. It uses Java and is parallelized, which means that it uses all CPU cores it can find to solve a problem.
$ phoronix-test-suite benchmark sunflow
During this test the things to watch are the frequency and perhaps the temperatures. I noticed that my frequencies didn't scale:
Every 2.0s: grep -i MHz /proc/cpuinfo                   Mon Mar 15 18:03:52 2010

cpu MHz         : 1600.000
cpu MHz         : 2400.000
cpu MHz         : 1600.000
cpu MHz         : 2400.000
Definitely something wrong. To further analyse these peculiarities an unparallelized test with short runs is needed. One of them is java-scimark2, a collection of mathematical algorithms. Here are the results on my faulty system:
$ phoronix-test-suite run java-scimark2

========================================
Test Configuration: Java SciMark
========================================


Computational Test:

1: Composite
2: Fast Fourier Transform
3: Jacobi Successive Over-Relaxation
4: Monte Carlo
5: Sparse Matrix Multiply
6: Dense LU Matrix Factorization
7: Test All Options

Enter Your Choice: 2

Would you like to save these test results (Y/n)? n

========================================
Estimated Run-Time: 5 Minutes
========================================



Java SciMark:
      java-scimark2 [Computational Test: Fast Fourier Transform]
      Estimated Test Run-Time: 5 Minutes
      Expected Trial Run Count: 4
            Started Run 1 @ 18:07:15
            Started Run 2 @ 18:07:49
            Started Run 3 @ 18:08:26
            Started Run 4 @ 18:09:00
            Started Run 5 @ 18:09:35
            Started Run 6 @ 18:10:07
            Started Run 7 @ 18:10:46
            Started Run 8 @ 18:11:18

      Test Results:
            481.9675767405814
            465.3790214886224
            167.040859476713
            484.107875662264
            481.9675767405814
            482.10078089758196
            142.08542169663755
            491.33539814892885

      Average: 399.49 Mflops
I would expect all runs to operate at about 480 mega FLOPS, however, there are two results that pull the average significantly down. Other tests for the java-scimark2 collection confirm these results so the CPU is definitely not operating as intended.

I've also tried swapping the power supply and memory modules to no avail and fortunately both the CPU and motherboard are still covered by warranty.

And once these tools become ubiquitous, there's the blog NIXCraft. It will just knock your socks off with the quantity of quality content. It really is ...simply the best! :)