[Lvlug] Re: Detecting new bad blocks on a disk while a system is running?

Keith Erekson kbe2 at Lehigh.EDU
Wed Dec 20 12:03:33 EST 2006


The tool you're looking for is called smartmon, or smartd. In Debian, 
it's packaged as smartmontools.

You can configure this to check the smart status of drives on a regular 
basis, which generally catches these things. You can also have it do 
longer tests periodically (weekly?) that should find these sorts of 
problems before they're really, really bad.

http://sourceforge.net/projects/smartmontools

-Keith

PS Sorry for the weird spacing below, I couldn't figure out how to make 
Thunderbird reply to the digest intelligently.

    I've had a question sort of floating around in the back of my head for a 
    while, I'll try to ask it, if nothing else that my clarify my thinking.  (I 
    have reasons to ask the question, I'll probably mention those.)

    The question is something like this: if a Linux system is running, and some 
    part of the disk develops an error, what happens?  I mean is there anything 
    that a Linux system does while running that might detect a (new) bad area on 
    disk and then either report that to the user, or add it to the bad blocks 
    list and then readjust the layout of the files on the disk to avoid that new 
    bad area?

    The reason I ask is that a few times in the last 12 months, I've had my system 
    start to act funny in some way.  (ATM, I can't recall those occurrences 
    enough to describe them--I'm sure I wrote about at least one to this list 
    describing some sort of (odd) symptoms and seeking help in troubleshooting 
    them.  

    In the two cases I know occurred (there might have been a third), after some 
    time either the system actually crashed or I intentionally or accidentally 
    rebooted, and then found that the system would not boot up.  On further 
    investigation, I typically found that one of my partitions was now bad (can't 
    remember exactly how I determined that), and I solved the problem by creating 
    a new partition out of unused space on the disk, then reinstalling as 
    necessary.

    Looking back, if Linux does nothing to look for badblocks in a running system 
    (or if it does something but I've missed the error reports--maybe I need to 
    start looking at those  ;-) , that might explain these occurrences.

    I was then going to ask how to prevent such occurrences.  If Linux does look 
    for these occurrences and reports them in one of the logs, I need to start 
    watching that log at least occasionally.  

    But if not, what kind of "maintenance" can be done on a continually running 
    system to detect such occurrences?

    And, if I find out that something like that has occurred, what should I do at 
    that point?   (Create a new partition and move the files from the old to the 
    new partition?)

    Does an fsck deal with this kind of occurrence?  Should I run fsck at regular 
    intervals (maybe most easily by rebooting)?

    Randy Kramer


-- 
Keith Erekson '03, '05
Senior Computing Consultant
Library & Technology Services
Lehigh University
keith at lehigh.edu




More information about the Lvlug mailing list