Wednesday, September 06, 2006

Work Around with Bad RAM

Wed Feb 2 07:12:38 2005 Badram, Badmem, and Memtest86.bin Posted by Drag
Search Keys: BadRAM BadMEM Linux kernel memtest memtest86
Referencing: http://rick.vanrein.org/linux/badram/

Well I was messing around on my computer I noticed that every once in a while a program would just get up and die all of a sudden for seemingly no reason. Then I noticed that when compiling big jobs my GCC compiler was segfaulting a awful lot.

So I know that when you get almost-random stuff going wrong like that, and you know that your using what should be a fairly stable OS the likely culprit is going to be flaky hardware. And out of flaky hardware the thing I hate the most is bad RAM modules, so that's what is most likely wrong.

So since it was debian I downloaded and installed memtest86 by typing into the console: apt-get install memtest86

Memtest86 is a very nice memory testing program for x86 machines. If you have a problem with memory hardware then this guy will find it. It'll check the L1 and L2 cache, it will check your memory modules and anything else that ends up as 'RAM' in your system.

How it works is that it boots up your computer, finds all the available memory, and then uses different patterns of bits and copies them from memory address to memory address in different fashions. These are 'tests' and it performs several of them on your computer. It takes while to complete the entire battery of tests, and once it's finished it simply starts over again at test one. It's best to let it run for a few hours because memory problems can be very intermittent.

If it finds any errors then it will tell you what memory range the test failed at.

Memtest86 is GREAT if your building a new computer and need to test the RAM. This is especially important with AMD64 machines and their touchy on-cpu-die built-in memory controllers. Sometimes reseating the RAM can fix problems, some memory sticks work in some motherboards and not others, sometimes simply moving the sticks to different slots will fix problems, or other times you need to underclock the machine to get it stable. Often you just get bad RAM and it needs to be replaced.

Well when apt-get installed memtest86 it copied it to my /boot directory and called it "/boot/memtest86.bin", then it modified my /boot/grub/menu.lst grub configuration file and added this entry:

title Debian GNU/Linux, kernel memtest86
root (hd0,0)
kernel /memtest86.bin
savedefault
boot

That way I could simply reboot, select the memtest86 entry in my grub boot-time menu and then the program would run.

However this won't work for all machines. There are several ways to run memtest86. For windows machines you can use a floppy image and make a bootable floppy with dd or rawrite. They also have cdrom ISO images you can use to make bootable cdroms from.

All this and very good documentation can be obtained from the memtest86 homepage

So I rebooted the desktop, selected memtest86 entry and let that go for a couple hours.

As it turns out the main node had a clearly bad section of RAM! Now this sucked because I had a gig of ram in that machine and to fix it normally I would have to toss away (since any warranty on them is long-gone) a 512meg memory module and that's pretty expensive for me to do.

(That'll teach me to be sure to use a anti-static grounding bracelet in the future when I assemble machines.)

Normally this would be my only choice, but with Linux there are a couple tricks you can do to get a perfectly stable machine with a RAM modules that has clearly one bad section, and that's it. I wouldn't do it with a production server, but with my little home desktop so it doesn't make much of a difference. Plus it was just a small section that was bad and no other issues as far as I could tell.

A couple of the tricks revolve around kernel patches called BadRAM and BadMEM. Out of the two, BadMEM provides a lot of features and such, but BadRAM seemed simple and 'good enough'. (BadMEM was originally based off of BadRAM).

Basically, how the work is that they take the bad section of RAM and make it part of protected kernel memory space. This makes sure that no programs will accidentally access it and it's like that particular section of the RAM module might as well never really exist. It's a surprising effective and safe fix, and it only adds a couple dozen lines of code to the kernel.

The downside is that if it is a section of RAM that is naturally occupied by the kernel at boot-time then your probably SOL because it will corrupt the kernel and probably make your system unbootable. Sometimes you can work around it by moving the memory cards around, or by making a very small kernel with lots of modules instead of built-ins, then you can sometimes work around it.

So I rebooted back into Linux, downloaded the patch for my specific kernel, built it (took a couple tries) and rebooted into memtest86. My particular version (not sure if it's part of all memtest86 versions) has the ability to change it's error output from simply stating the affected memory space, but to print it out in a form that I can easy use with BadRAM-patched kernel parameters.

After about 5 minutes of running memtest86 spit out: badram=0x13495568,0xfffffffc
then
badram=0x13495568,0xfffffff,0x13495568,0xfffffffc
then
badram=0x13495568,0xfffffff,0x13495568,0xfffffffc,0x13495568,0xfffffffc
and so on and so forth. I let it run for another 45 minutes or so, but it didn't report any other bad sections so I rebooted.

In grub I hit "e" to edit my menu entry, selected the kernel line, hit "e" again, then modified my kernel entry from this:
kernel /vmlinuz-2.4.22-1.2199.nptl-ssi-686-smp devfs=mount hdb=ide-scsi hdc=ide-scsi root=/dev/hda2 ro
to look like this:
kernel /vmlinuz-2.4.22-1.2199.nptl-ssi-686-smp devfs=mount badram=0x13495568,0xfffffffc hdb=ide-scsi hdc=ide-scsi root=/dev/hda2 ro

I hit 'return' and then 'b' to boot. Once it booted up I made the change permanent by editing my boot config at /boot/grub/menu.lst and now I have a perfectly stable machine once again.

I figure that this would be especially useful for older machines that you may use for a firewall, a simple e-mail server, or something like that that may have become unstable due to memory errors. Or maybe if you have a Intel Pentium III (or was it 4?) that has the RAMBUS style ram that is incompatible with the much more common (and cheap) sdram or ddr sdram types.

February 2, 2005

Please note that there are two sites featuring a memory test utility known as memtest: the one you mentioned, rel="nofollow">http://www.memtest86.com and its cooperative competitor, memtest86+ at rel="nofollow">http://www.memtest.org . The latter is based on the former, with improvements and bug fixes that are released more frequently. The current version of memtest86+ is 1.50, released in Jan '05. Both memtest86 and memtest86+ are O/S independent. You should mention that the BIOS memory test offered by most PC's is virtually worthless at detecting bad memory, unless it's completely absent. Also note that other Unixes have there own method of mapping available memory at boot time. SCO Openserver allows you to use the mem= option at the boot: prompt to include or exclude ranges of memory for use by the system. And finally, many quality brands of memory come with lifetime warranties. The definition of "lifetime" varies by vendor, with Kingston in my opinion having the most liberal policy of them all: if it fails, they'll replace it, even if the product is obsolete. No proof of purchase required. Amazingly enough, I've never had to make use of their warranty after all these years. On the other hand, every single piece of no-name SDRAM we purchased 4 years ago failed within 18 months.

Bob

February 2, 2005

How stable have things been, since you have mapped the badram? The Kingston memory policy of replacing, no questions asked, is very nice. I will certainly consider Kingston when I purchase my next bunch of RAM, which is coming up shortly, as I piece together another server for home.

BruceGarlock

February 3, 2005

Things are pretty stable. I haven't noticed any more issues with things crashing randomly and I've been playing around with scripts and stuff that do a good job of thrashing the cpu and filing system. Also I ran a memtest utility that was installed as part of Debian's sysutils package and it ran for a couple hours without finding any problems. I don't think it's does as good job as memtest86, but it runs on the OS so I can test the memory with the badram mapping working. If I had the chance of changing out the memory or sending back to the manufacturer, I would much prefer that. But I don't remember even were I bought the stuff from. I never tried anything like mapping around errors before, but it seems to work pretty well so far. at rel="nofollow">http://cr.yp.to/hardware/ecc.html D. J. Bernstein 's homepage he talks quite a bit about the advantages of using ECC ram to protect against errors, and I think that I agree with him for the most part. Especially with these AMD64 machines, as I seem to have more issues with them, for some reason (could just be me). With memory sizes going up past 1-2gigs thats a lot that can go wrong. Also as a sidenote he has good advice on building a high quality, but inexpensive workstation http://cr.yp.to/hardware/advice.html.

Drag

February 4, 2005

Hmmm. Interesting links. Although this is off-topic, I am hopefully going to build a SCSI HD enclosure this weekend, with parts from http://www.scsisource.com/scsi_enclosure_cables/ - I plan on writing an article. I am using an old tower to house the SCSI drives, and it is basically stripped of everything, but the power supply, and motherboard (without CPU). I have a bunch of drives that I would retire, after upgrading to a larger drive. Most of them are 9.1GB U160 drives, so I would like to put them together on a software RAID-0 array, and use it as a staging area for video editing, and images. After that project, I plan on building another tower, based on SCSI drives, and an AMD CPU. It looks like I will try to purchase Kingston memory after reading about their warranty, and no hassle policy above. I am currently searching for the perfect motherboard. I would like to find something that has plenty of PCI slots, and I would prefer at least one slot to be 64-bit, since my Adaptec card supports 64-bit PCI slots (backwards compatible with regualr 32-bit slots). I know SCSI is still expensive, but the performance is stellar, and the fact that you can add so many drives to one controller is a plus. I have not decided on a Linux distro yet. I may try Xandros, which is Debian based, or Fedora core-3. I may plan pn running some commerical software on it, like BackupEdge, MainActor (video editing, although Kino is really coming along nicely), and possibly a few other commerical apps. I am yet to get my feet wet with Debian. I have heard great things about it. I wonder what distro Linux runs? Maybe it's all homegrown for him. I can see him running something line Gentoo. Good thing your system is now stable. My current tower is in need of a new CPU fan. I would wake up every morning to the CPU alarm, from being overheated. Good thing it is winter. I have the tower in my basement, where it is about 15-20 degrees cooler than the rest of the house. Currently, I have the cover open, and the tower at a 45 degree angle, because for some reason, the fan will only operate at that angle :-) Looks pretty bad, but it works. I don't know why I just don't replace the fan - they are not expensive :-) Maybe this weekend, if I need to run out to get any spare parts for my other project :-)

BruceGarlock

February 4, 2005

You mean what distro Linus runs? I think that last time I heard he was using Fedora. But then somebody says that he was using PowerPC dual-G5 setup, so Fedora's PPC port isn't that hot. I would guess Fedora if push comes to shove. Doesn't realy matter. Personally I LOVE Debian. It's the best, in my opinion. But I am open to other ones. Also FreeBSD is nice.. actually the documentation is VERY nice compared to most Linux setups. I use Debian Testing (also called Debian Sarge) on both my main machines. I have a ibook that runs the PPC version of Debian and my main desktop is now a cluster. :P It's a OpenSSI-based cluster based on Debian Testing with a heavily patched Fedora Core 1 kernel. It's made up of three machines, but I only realy use 2 right now. In the main node called "spock" it has the 1gig of ram that I talked about in the report above. I has a 2400+ AMD proccessor, 1gig of DDR ram, 80gig harddrive and 2 nic cards. The motherboard is a biostar (bleh, I don't that brand so much, next time I'm getting Asus) Via kt600 chipset based setup. The secondary node is called "alabama" and has the same cpu and motherboard with 256 megs of RAM. The harddrive setup is kinda unique. This is all very experimental for me, so I am learning as I go. It has 3 120 gig 7200 harddrives. 1 is a older WD drive with 8meg cache that goes thru the onboard ATA100 (or ATA133?) controller. The other two atatch thru a SATA (ATA150, I beleive) Sis PCI to IDE adapter. They are setup in a software RAID 5 array, and on that I run it as a LVM volume group. Before setting up the LVM stuff I ran the array as one big ext3 formatted file system to see what performance advantage it would have over a single drive. Since the computer was running on a shared ROOT partition on the other computer I was free to format and try different setups with different block sizes and ext3 raid optimizations.

Basicly the performance advantage for Read Write performance is non-existant over a single drive, unfortunately. And it actually may be slightly less. Although one harddrive is very different interface (PATA vs SATA) from the other two, they are still your basic 8meg cache, 7200rpm, 120gig drives and idividually peform very close to one another. (the WD is actually in between the two Matrox drives. They were bought both at the same time, from the same place, and are the same model. So it goes to show you the variations in production) I think that is a potential performance increase, but it's negated by the overhead from parity... As for RAID0, I didn't try that. But I've read at storagereview.com and anandtech.com that there is very very little if any performance advantage for just 2 drives. Now this is all generic IDE drives runningn on cheasy software-driven propriatory drivers with onboard controllers that Windows 'overclocker' types use, but I don't know if software-based RAID 0 setup with just two drives.. Now I definately no expert, but this is just my personal experiances (and very limited at that), but I think that going with a Raid 0 array may be a mistake. I'd try it out and benchmark it and
such. But I dont' think that it's going to be very nice. Now I'd understand that if your working with large files you'd want all that disk space as scratch space, but you could get the same thing with Linux's LVM stuff. It's fairly risky because your not going to get any reduncy and if one drive blows out then your information on both drives is probably worthless after that, but it's the same thing with RAID 0. The advantage as I see it over using LVM over Raid 0 is that it would be very easy to add and subtract extra drive space. You can resize partitions and such, and make them span accross multiple drives and RAID arrays. So if your working on something that needs more disk space then you have you can just run down to the store and buy a extra 250gig drive and slap it in. Then you modify the volumegroup to include the new drive, or a partition off of the drive, and then expand your logical volume (your 'partition') to fill up the newly added space. This isn't safe, of course, from one of your drives failing, but it's perfectly fine for situations were maximum disk space is preferable over redundancy and such. You've mentioned Kino and Mainactor.. There is a application that I've been meaning to play around with, but I haven't been able to because I don't have a DV camera and whatnot (need to borrow
one from my brother). It's called Cinelerra. It's designed to be a "professional" level non-linear video editor. It seems very nice, but it's very new user unfriendly. They tell you point blank on the website that if your looking for a application for editing your home movies your looking in the wrong place. It's at: http://heroinewarrior.com/cinelerra.php3 From what I can tell it's very effective though. They have some demo movies you can download, they are in a special quicktime format that they've developed specificly for this application, so you need to download
their special video player and codec, but they have a motorcycle one that is very nice looking... Well, at least to me. (I have no experiance in these matters) It's a full 60fps and they have some very slick editing things going on. All very nice looking with no visual artifacts or blemishes from what I can tell. Smooth. They also have some other applications to go along with it, that may be usefull for things other then just video editing. They have a thing called 'firehose' that combines
multiple network interfaces into a special purpose data pipe. You can't realy network over it, but it's just for moving data from one computer to another very fast. They also have the ability to support special purpose render nodes to help take the load off of the main computer when doing effects and such. They say that it's capable of running real time effects on HD-sized video, which
I think seems pretty fantastic. This company sells workstations and clusters and such that run cinelerra and has the ability to support many different applications. It's at: rel="nofollow">http://www.lmahd.com/cinelerra.html VERY expensive. But I think it's worth looking athe hardware they are using. Mostly IBM stuff, but a person that can build their own computer can probably put together something similar. Of course this is all for state of the art stuff. I don't expect normal TV resolutions to need so much grunt work. Now remember I am pretty ignorant about video editing and such, and I don't even own a camera, but this is something I would realy like to play around with. With Debian they have pre-compiled binaries that work, and they have some for other distributions. If you want to use Mainactor, you could probably make it work in any distro with some massaging, but they have specific versions for Mandrake and Suse, I think. But most any distro will work fine, it's just up to your personal preference as far as I can tell. (which isn't very far). Also there are many other Linux applications for this purpose. Apple has "Apple Shake" that has a Linux version, however the Linux version is 5000 dollars vs the OS X version which is 3000 dollars. Then there is "Smoke" which is some absolutely horrificly expensive setup that includes a special IBM/Linux machine. And there are 3-4 others that are much more reasonably priced, which I can't remember right now. Mainactor and Kino look more like my speed right now though. ;) (mainactor has a Demo version for Linux you can try out, btw. But personally I prefer Free applications whenever possible)

Drag

No comments: