Wednesday, 1 October 2014

It is a bad plan that admits of no modification

I find it somewhat interesting that thousands of years later that our society still uses Publilius Syrus sententiae though I imagine the tendency to leave well enough alone means such phrases stay in usage.

Marvell ARM system - Photo from Steve McIntyre
One weekend Steve McIntyre asked me if I could find a source of some of some 40mm fans for some systems with some pretty strict requirements. They needed to be long life and shift a lot of air to combat a persistent overheating issue.

I sat with him and went through the Farnell utterly hateful parametric web interface and eventually came up with a couple of options which were very expensive. Only then did I stop and ask what the actual problem was.

Marvell ARM system Original internal cooling arrangement - Photo from Steve McIntyre
Steve showed me one of the Debian ARM buildd boxes which are Marvell development machines. These systems are powerful quad core machines housed in compact steel enclosures.

There is a single 40mm fan trying to provide cooling for the entire enclosure. When the units are placed horizontally and used intermittently this proves adequate. Unfortunately when the system are arranged vertically in a rack and run at full load continuously they often overheat and have to be restarted. In addition the small high speed fans need replacing frequently as their bearings wore out quickly.

Debian ARM buildd systems - Photo from Steve McIntyreThis was obviously causing some issues for the ARM Debian ports which Steve wanted to rectify. After talking the problem through for a while we came to the conclusion we could use much larger 60mm fans to blow air directly through the top of the case onto the cpu heatsink.

Larger fans can be run much more slowly to move a similar volume of air to the smaller 40mm fans which gives a much longer service life.

Hole punch and Drilling template
Steve proceeded to order enough parts to allow us to modify all the Debian systems, this worked out cheaper than a single "special" 40mm high volume fan.

I acquired a rather large steel hole punch, I chose this tool because it produces a much superior finish to a hole cutter and this project demanded a high level of finish (not to mention I loved having a valid excuse to own and use a huge allen key!)

If we had simply been modifying a single case I would have measured and marked up by hand. With the prospect of altering at least eight I laser cut a template from plywood which Andy Simpkins took great glee in excessively annotating.

We also used the opportunity to add bolt holes to securely attach the 2.5 inch SATA drives instead of using sticky pads.

Steve and I modified a single system to begin with both to check our alignment and the efficacy of the change. We were pleasantly surprised to discover that hoiby could now repeatedly do kernel compiles with all four cores flat out which was not possible before. The measured CPU temperature, which had previously been around 90°C, did not rise above 40°C

Steve and Andy on the assembly line
Steve, Andy and I then arranged a day where we took all the remaining units out of the rack at ARM, modified and returned them. We used the facilities at the Cambridge Makespace where I am a member to do the modifications.

I broke two 3mm drill bits and dulled a 4mm bit drilling all the holes, Roger Smith was good enough to loan us the use of his "Christmas tree bit" to ream the fan hole out to 16mm so we could thread the hole punch and cut the 60mm fan aperture out.

six modified systems ready to be re-racked.
We managed to get quite an assembly line going and, in my opinion, the results look pretty professional.

It has been several months since we did this work and these systems continue to run without issue. To complete the story we can see some graphs courtesy of the DSA munin instance.

CPU load on arnold.debian.org
You can clearly see the huge drop in temperature at the end of Week 25 despite the continuously high CPU load. Also there is only a single gap in the data after the changes (these indicate crashes where data was not recorded) where before there were frequent and extensive times where the systems were simply unusable.

CPU Temperature of arnold.debian.org
One reason I continue to enjoy Debian so much is the wide variety of ways in which I can contribute not only by maintaining my packages. Sometimes this kind of work does not receive the credit it deserves and hopefully highlights a small part of the frantic paddling that goes on under the serene surface of the Debian project to keep things "just working".