Project

General

Profile

Bug #11091

System lost disks in pool after reboot

Added by Aleksey Svirikin about 5 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Important
Assignee:
Alexander Motin
Category:
OS
Target version:
Severity:
New
Reason for Closing:
Reason for Blocked:
Needs QA:
Yes
Needs Doc:
Yes
Needs Merging:
Yes
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:

HP Microserver Gen8

CPU: G1610T

HDDs:
Western Digital WD40EFRX
Seagate ST4000DM000
Western Digital WD1002FAEX
Western Digital WD10EARS

ChangeLog Required:
No

Description

I got two mirrors 1+1TB and 4+4TB.
The system lost second mirror after reboot twice. Second time was after clean reinstall.
Smart looks fine.
Seems like disks changes therir ID in reboot or meta-data was written incorrectly.
In alerts (useless, but copy-pasted): CRITICAL: "The volume Archive (ZFS) state is UNKNOWN:"

[root@freenas] ~# zpool status
  pool: Storage
 state: ONLINE
  scan: none requested
config:

        NAME                                            STATE     READ WRITE CKSUM
        Storage                                         ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/3c692f84-4763-11e5-aad8-3ca82a9f37fc  ONLINE       0     0     0
            gptid/3cd23123-4763-11e5-aad8-3ca82a9f37fc  ONLINE       0     0     0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          da0p2     ONLINE       0     0     0

errors: No known data errors
[root@freenas] ~# zpool import
   pool: Archive
     id: 13013844282467277978
  state: UNAVAIL
 status: One or more devices contains corrupted data.
 action: The pool cannot be imported due to damaged devices or data.
   see: http://illumos.org/msg/ZFS-8000-5E
 config:

        Archive                  UNAVAIL  insufficient replicas
          mirror-0               UNAVAIL  insufficient replicas
            9494490964606863191  UNAVAIL  corrupted data
            8825516877593544369  UNAVAIL  corrupted data

Associated revisions

Revision 0f77638e (diff)
Added by Alexander Motin about 5 years ago

Increase disk erase areas from 1+4MB to 32+32MB. According to report, our metadata erase areas may be insufficient. There is no known value that would be sufficient, but at least DDF specification tells about at least 32MB of metadata. Ticket: #11091

Revision 00cd6f0d (diff)
Added by Alexander Motin almost 5 years ago

Increase disk erase areas from 1+4MB to 32+32MB. According to report, our metadata erase areas may be insufficient. There is no known value that would be sufficient, but at least DDF specification tells about at least 32MB of metadata. Ticket: #11091 (cherry picked from commit 0f77638eef78a5b4b0eba1ad039fad2ef82471d0)

Revision 6205c1dd (diff)
Added by Alexander Motin almost 5 years ago

Increase disk erase areas from 1+4MB to 32+32MB. According to report, our metadata erase areas may be insufficient. There is no known value that would be sufficient, but at least DDF specification tells about at least 32MB of metadata. Ticket: #11091 (cherry picked from commit 0f77638eef78a5b4b0eba1ad039fad2ef82471d0)

History

#1 Updated by Aleksey Svirikin about 5 years ago

I faced this problem 3 or 4 times in a row (all times disks was lost on every reboot), before I found a solution.
The solution is to wipe the disks with zeros on whole disks.
Looks like the problem is with metadata at the end of the disk. Probably freenas doesn't wipe the end of the disks on pool creation where the old metadata was?

#2 Updated by Xin Li about 5 years ago

  • Assignee changed from Xin Li to Alexander Motin

I have looked at the dump and found that the partition table on ada0 is the same on ada2, and ada1 is the same on ada3 (GEOM dumps). So, somehow the OS asked for data from ada2/ada3 but got data from ada0/ada1 instead, however, the two devices have been probed as ada2 and ada3 correctly and reading of their capacity worked well.

#3 Updated by Alexander Motin about 5 years ago

  • Status changed from Unscreened to Screened
  • Priority changed from No priority to Important

I agree with Aleksey's diagnosis. This system has some HPB120i software "RAID", not supported by FreeBSD. My guess is that for some reason disks kept residual RAID metadata, that caused the RAID BIOS do unexpected things during OS reboot. If full disk erase helped with the problem -- this is my best guess.

As I can see, FreeNAS erases 1MB at the beginning of each disk, and 4MB at the end. Probably RAID metadata reside somewhere out of these areas. It should not be a problem to increase the areas, but it would be good to know how much. We can not erase the whole disk each time -- it would take hours.

#4 Updated by Alexander Motin about 5 years ago

  • Status changed from Screened to 19
  • Target version set to Unspecified

In nightly branch I've increased metadata erase from 1+4MB to 32+32MB, hoping it will be enough. Unfortunately I was unable to find what is enough for this RAID. I'll test the next build and then merge for the next SU.

#5 Updated by Aleksey Svirikin about 5 years ago

Alexander Motin wrote:

I agree with Aleksey's diagnosis. This system has some HPB120i software "RAID", not supported by FreeBSD. My guess is that for some reason disks kept residual RAID metadata, that caused the RAID BIOS do unexpected things during OS reboot. If full disk erase helped with the problem -- this is my best guess.

Previously, these drives have been built in the mirror on Adaptec 3405. HPB120i was desabled.
I can run tests on other disks, but with this controller.

#6 Updated by Alexander Motin about 5 years ago

It is always great when problem is reproducible. It at least allows to know when the problem is fixed.

#7 Updated by Alexander Motin almost 5 years ago

  • Status changed from 19 to Ready For Release
  • Target version changed from Unspecified to 261
  • ChangeLog Entry updated (diff)

This is probably all we can do without more info.

#8 Updated by Suraj Ravichandran almost 5 years ago

  • Status changed from Ready For Release to Resolved

#9 Avatar?id=14398&size=24x24 Updated by Kris Moore about 4 years ago

  • Target version changed from 261 to N/A

#10 Updated by Dru Lavigne almost 3 years ago

  • File deleted (debug-freenas-20150822101130.tgz)

#11 Updated by Dru Lavigne almost 3 years ago

  • File deleted (debug-freenas-20150820164952.tgz)

Also available in: Atom PDF