Sunday, March 02, 2014

Fixing what fsck cannot fix

Before I start, lets get this out of the way....

WARNING: The tools and techniques described below, if misapplied, will turn your file system consistency to that of warm custard.  If you are not careful you can irreparably damage your data and lose files.  Be warned.

Next, do a backup of your data - right now.  So you don't have to try and salvage a busted file system.  It is a lot less stressful.

Now that we have those public service announcements out of the way...  It came to pass that I wanted to use my NetBSD on a stick but it was having problems booting.  I booted from my HDD and did a fsck of the memory stick filesystem which found a few errors probably the result of an ungraceful shutdown.  One of the files it complained about was the kernel (/netbsd) which would explain the problems booting but it also said that the /mnt directory was corrupted.  Fsck complained about a missing . and .. entry then claimed everything was fixed.  I have a habit of doing a second fsck if the first one found errors and fixed them, just to be certain everything was caught and fixed.  Unfortunately for me fsck still did not like /mnt and went through the same motions as the first time.  Not a good look.

I thought the easiest way may be just to remove the bad directory so I mounted the memory stick and tried to rm the mnt directory on the stick.  Nope, rm said the directory was not empty.  I tried to cd into the directory *kapow* kernel panic due to an inconsistent file system (no surprise there).  So, what to do?  fsck won't fix the error, I could rebuild the stick but I didn't want to spend the time doing that.

Enter the fsdb(8) command, this command allows you to perform some low level manipulation of a file system.  It will allow you to do things to a file system that you are normally prevented from doing - in other words you have the power to make a real mess if you do the wrong thing.  What I wanted to do was just force a removal of the broken mnt directory which is quite easy to do.  For me my memory stick was identified as sd0 and the root file system therein was on the "a" partition so I used fsdb to open the file system:

fsdb -f /dev/rsd0a

which printed a bunch of information about the filesystem and put me at a fsdb prompt.  To remove the bad directory I wanted to do two things, firstly clear the inode associated with the mnt directory and then remove the mnt entry from the parent directory.  To clear the inode we need to know the inode number, there are quite a few ways to get it, "ls" inside fsdb will show the directory entries which contains the inode number, ordinary ls from the command line could be used too if the file system is mountable.  For me, I had the report from fsck that inode 10336 was corrupt and that is was the mnt directory.  I validated this information using the ls in fsdb.  So, knowing the inode, clearing it is a matter of using the "clri" command:

clri 10336

fsdb confirmed the inode is cleared.  Now remove the directory entry from the parent directory (in my case this was /):

cd /
rm mnt

this will invalidate the directory slot associated with the name given.  Once this is done I just quit fsdb:

quit

Fsdb prints a warning message that the file system has been marked dirty and that a fsck is required to clean up any damage.  Following that advice I ran:

fsck -y /dev/rsd0a

and let fsck clean up after the surgery.  The file system cleaned up with no major problems.  I mounted up the memory stick and copied a new kernel onto it since the old one appeared to be mangled.

After that NetBSD on a stick worked fine... much easier than recreating the whole thing from scratch.
Again, you really should not be running fsdb on a filesystem you care about, you should have good backups and not need to resort to this level of skullduggery to recover but fsdb is handy to know about when things go really bad.

No comments: