Project

General

Profile

Bug #5808

Corrupted freenas filesystem

Added by Richard Kojedzinszky over 6 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Nice to have
Assignee:
Xin Li
Category:
-
Target version:
Severity:
New
Reason for Closing:
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Description

On many of our FreeNAS systems, I found similar file system errors:

  1. fsck -t ufs -n /dev/ufs/FreeNASs2a
    • /dev/ufs/FreeNASs2a (NO WRITE)
    • Last Mounted on /
    • Root file system
    • Phase 1 - Check Blocks and Sizes
    • Phase 2 - Check Pathnames
      ?/__init__.pyo IS AN EXTRANEOUS HARD LINK TO DIRECTORY /usr/local/lib/python2.7/site-packages/django/contrib/admindocs/locale/vi
      REMOVE? no

BAD TYPE VALUE I=225218 OWNER=root MODE=40755
SIZE=512 MTIME=Aug 6 19:18 2014
DIR=?/__init__.pyo

UNEXPECTED SOFT UPDATE INCONSISTENCY

FIX? no

?/testmod.py IS AN EXTRANEOUS HARD LINK TO DIRECTORY /usr/local/lib/python2.7/site-packages/django/contrib/admindocs/locale/vi/LC_MESSAGES
REMOVE? no

BAD TYPE VALUE I=225219 OWNER=root MODE=40755
SIZE=512 MTIME=Aug 6 19:31 2014
DIR=?/testmod.py

UNEXPECTED SOFT UPDATE INCONSISTENCY

FIX? no

BAD INODE NUMBER FOR '..' in DIR I=225216 (/usr/local/lib/python2.7/site-packages/django/contrib/admindocs/locale/de/LC_MESSAGES)
CURRENTLY POINTS TO I=173474 (?), SHOULD POINT TO I=223610 (/usr/local/lib/python2.7/site-packages/django/contrib/admindocs/locale/de)
FIX? no

  • Phase 3 - Check Connectivity
  • Phase 4 - Check Reference Counts
    LINK COUNT DIR I=223610 OWNER=root MODE=40755
    SIZE=512 MTIME=Aug 6 19:18 2014 COUNT 3 SHOULD BE 2
    ADJUST? no

LINK COUNT DIR I=225218 OWNER=root MODE=40755
SIZE=512 MTIME=Aug 6 19:18 2014 COUNT 3 SHOULD BE 4
LINK COUNT INCREASING
UNEXPECTED SOFT UPDATE INCONSISTENCY

ADJUST? no

LINK COUNT DIR I=225219 OWNER=root MODE=40755
SIZE=512 MTIME=Aug 6 19:31 2014 COUNT 2 SHOULD BE 3
LINK COUNT INCREASING
UNEXPECTED SOFT UPDATE INCONSISTENCY

ADJUST? no

LINK COUNT FILE I=225220 OWNER=root MODE=100644
SIZE=2013 MTIME=Dec 12 20:37 2013 COUNT 1 SHOULD BE 2
LINK COUNT INCREASING
UNEXPECTED SOFT UPDATE INCONSISTENCY

ADJUST? no

  • Phase 5 - Check Cyl groups
    39457 files, 1360541 used, 2169791 free (3175 frags, 270827 blocks, 0.1% fragmentation)

The only common amongst them is that all FreeNAS is installed on a SanDisk Cruzer Fit usb stick. Maybe the hardware is so buggy?
Could someone else check his/her filesystem?

Associated revisions

Revision 9cc691ab (diff)
Added by Xin Li about 6 years ago

Don't use conv=sparse when we are using small blocks. The boot file system is mostly full and therefore the optimization would be less useful, and skipping the zero area may leave some stale data, when they are covering metadata, they could confuse the file system. A better solution would probably be to use newfs+restore, which writes only to the areas that matters. However since we are going to ZFS so just take a conservative approach in this branch now. Ticket: #5808

History

#1 Updated by Richard Kojedzinszky over 6 years ago

Unfortunately, on my FreeNAS in virtualized environment has same issue, so I may rule out the usb sticks for now.

Any ideas?

#2 Updated by Jordan Hubbard over 6 years ago

  • Category set to 21
  • Assignee set to Xin Li
  • Target version set to 49

That's really weird (that it would happen in a VM, not just on a USB stick). Are there any properties of the UFS filesystem format / options we're using that could cause fsck to report such things perhaps erroneously?

#3 Updated by Richard Kojedzinszky over 6 years ago

It is standard FreeNAS's filesystem, as I saw in nanobsd.sh it is UFS1 with soft-updates. Today I have firmware upgraded a freenas, after booting it has a clean root filesystem. I will check it for errors periodically.

The case today was that I saw core dumps from periodic cron job alert.py. Then I issued a find /usr, and that terminated with some kind of fts_read() error, possibly receiving it from the kernel. The very strange thing is that this sympton disappeared after 10-20 minutes, without a restart, or fsck-ing the filesystem. It is unbelievable to me also.

I will report if the freshly upgraded nas get some errors.

Regards,

#4 Updated by Xin Li over 6 years ago

  • Status changed from Unscreened to Screened

This really looks like something happen to the hardware.

Note that you should never run fsck on a read-write mounted file system. Typical usage is to do fsck -B -p which creates a snapshot and is capable of writing the fixes back afterward.

#5 Updated by Richard Kojedzinszky over 6 years ago

Actually freenas's root is read-only, I may run a read-only fsck on it, may I not? The only thing I've done to it is to remount it rw, replaced some .py script int the root, and remounted it back read-only. Maybe in this process somehow the fs got damaged?

#6 Updated by Richard Kojedzinszky about 6 years ago

The problem still exists.

I could reproduce the problem in a VM, which exludes the possibility of a hardware bug.

You can follow the procedure here:

http://pastebin.com/pSz9qmiT

I actually checked the embedded firmware.img, that it does not contain filesystem errors.
After it I've ran the install procedure, which copies the image, and then mounts it, makes some modifications, and then unmounts it. After it, I am checking the new partition with fsck again, and it shows errors.

I suspect that this is not the intended behaviour.

#7 Updated by Richard Kojedzinszky about 6 years ago

Could that be that the 'conv=sparse' option to dd causes some misbehaviour?

I ran the bin/updatep1 script by hand, and have seen that the fsck just after the dd shows errors. After it, I've ran that dd without the conv=, and fsck shows everything is fine.

See http://pastebin.com/AdYJcSm0 how can it be reproduced.

#8 Updated by Richard Kojedzinszky about 6 years ago

Original nanobsd update scripts dont add the conv=sparse flag to dd.

Suggesting a revert of b7adc3aa6e0e5d2ea030e4a8cffbfd01b0e45653.

#9 Updated by William Grzybowski about 6 years ago

Its is not causing any harm.

Plus this is all a moot point since 9.3 will use ZFS for the root fs.

#10 Updated by Richard Kojedzinszky about 6 years ago

I could produce problems. And in theory, if a file would be of 128k of zeros, than it would not be written at all, and in the new filesystem it would have other contents. I think a filesystem image should be the exact copy all the time. At least my experience shows that.

#11 Updated by William Grzybowski about 6 years ago

Why is that a problem? The filesystem is supposed to be read-only.

#12 Updated by Richard Kojedzinszky about 6 years ago

Maybe I was not clear, lets start again.

The upgrade procedure writes the whole filesystem image to media, with block-sized 0's skipped. What if that 0's are part of a file? In the media it just not gets overwritten, thus it will have some previous data. And of course this happens, that's why I have a corrupted filesystem just after a fresh upgrade.

Basically, using conv=sparse to dump a filesystem image is the same as overwriting all block-sized zeroes on any filesystem. Do you agree that this could lead to corruption?

#13 Updated by Xin Li about 6 years ago

  • Status changed from Screened to Resolved
  • Target version changed from 49 to 9.2.1.8-RELEASE

Also available in: Atom PDF