Project

General

Profile

Bug #47668

mmap segfaulting on some files

Added by Louis Letourneau about 2 years ago. Updated almost 2 years ago.

Status:
Closed
Priority:
No priority
Assignee:
Alexander Motin
Category:
OS
Target version:
Seen in:
Severity:
New
Reason for Closing:
Cannot Reproduce
Reason for Blocked:
Needs QA:
No
Needs Doc:
No
Needs Merging:
No
Needs Automation:
No
Support Suite Ticket:
n/a
Hardware Configuration:
ChangeLog Required:
No

Description

We have some files on our zfs pool that cannot be copied.

After using truss, I see that the mmap cp does causes a Bad Address

This also affects Apache WebDav. Apache segfaults on these files because of mmap.

cat, rsync, anything that doesn't use mmap works fine.

if I
cat OFFENDING_FILE > /tmp/bak
cp /tmp/bak OFFENDING_FILE

Everything falls back into place.

Not all files have this issue only a few.
It doesn't seem to be size related

I wrote some python code to reproduce the problem seen with apache and cp

fs being the whole file size

mm = mmap.mmap(f.fileno(), 0)
a = mm.read(fs)
mm.close()

This would crash is fs is the exact file size
I can shrink fs until it starts working.

help please

History

#1 Updated by Louis Letourneau about 2 years ago

  • File debug-nas-460-002-20180921204314.txz added
  • Private changed from No to Yes

#2 Updated by Dru Lavigne about 2 years ago

  • Assignee changed from Release Council to Alexander Motin

#3 Updated by Alexander Motin about 2 years ago

  • Status changed from Unscreened to Screened

I don't think I saw any reports like this before, and I have no quick ideas what could it be. I'd recommend to report this issue to FreeBSD mailing lists in case somebody else hit it or have some idea, since this problem does not sound FreeNAS-specific. It would be good to try latest FreeNAS 11.2 BETA build to make sure the problem still exist there.

#4 Updated by Louis Letourneau about 2 years ago

I'm at a point where I need to debug mmap to see why or where it fails.

Would you have any other ideas or things to log to help pinpoint? Could zfs store a file that would be non-contiguous once mmapped?

#5 Updated by Alexander Motin about 2 years ago

My first suspicion was that some files have corrupted checksums, that could explain read problems, which I am not sure how supposed to be handled in case of mmap, possibly indeed as panics. But I see neither checksum errors in debugs, nor it would allow to copy files in other means.

My second guess is that it not related to specific file, but to OS state. I'd try it with the same file but after FreeNAS reboot, for example, to make sure it is not a problem of lack of RAM or something.

ZFS has no problems with non-contiguous files, since with compression enabled any significantly large range of zeroes after compression looks almost the same as never written hole (aside of birth time of the blocks). I can not say it is unrelated, but that would not be my first suspicion.

Mostly out of curiosity, why do you need mmap() to copy files? Unlike UFS where mmap() directly accesses pages from file system cache, in case of ZFS kernel just allocates separate buffers and fill them with data explicitly read from ZFS. It may still have sense for some complicated software, but for things like copying and web serving I'd guess usual read/write could be more efficient.

#6 Updated by Louis Letourneau about 2 years ago

I do not need mmap at all.

The problem is apache/webdav uses it unless I set EnableMMAP off.
And I can't do that through the UI, I needed to change the '/conf/base/etc/ix.rc.d/ix-apache' file directly

Also cp and samba also use mmap

For example, truss output of cp:
Source is: /mnt/pool/s3_video/hmediasl20182019/15568/15568-4_0006.ts
destination is : /tmp/g


[...]
fstatat(AT_FDCWD,"/mnt/pool/s3_video/hmediasl20182019/15568/15568-4_0006.ts",{ mode=-rw-rw-r-- ,inode=11853732,size=786968,blksize=131072 },0x0) = 0 (0x0)
stat("/tmp/g",{ mode=-rw-rw-r-- ,inode=369,size=786968,blksize=4096 }) = 0 (0x0)
openat(AT_FDCWD,"/mnt/pool/s3_video/hmediasl20182019/15568/15568-4_0006.ts",O_RDONLY,00) = 3 (0x3)
openat(AT_FDCWD,"/tmp/g",O_WRONLY|O_TRUNC,00) = 4 (0x4)
mmap(0x0,786968,PROT_READ,MAP_SHARED,3,0x0) = 34366312448 (0x800645000)
write(4,0x800645000,786968) ERR#14 'Bad address'
cp: write(2,"cp: ",4) = 4 (0x4)
/tmp/gwrite(2,"/tmp/g",6) = 6 (0x6)
: write(2,": ",2) = 2 (0x2)
Bad address
write(2,"Bad address\n",12) = 12 (0xc)
munmap(0x800645000,786968) = 0 (0x0)
close(4) = 0 (0x0)
close(3) = 0 (0x0)

There are many system programs that use mmap. When they use it on the "bad" files they usually segfault.

I will try the reboot as soon as I can.

#7 Updated by Alexander Motin about 2 years ago

  • Status changed from Screened to Blocked
  • Reason for Blocked set to Need additional information from Author

Once you are saying even simple `cp` is affected, it sounds even more odd to me, since somebody else would notice it too, if it would be that simple. Please try to give us something we could reproduce. Also there should be FreeNAS 11.2 release soon, which you may try in BETA's now, that may cover it just due to OS update.

#8 Updated by Louis Letourneau about 2 years ago

I forgot to say, after the reboot all of the files that had the issues worked fine; which is super scary to us.

I haven't been able to reproduce the issue since but it was clearly not something written to disk. Maybe something with locks or cache leaks...I just don't know.

#9 Updated by Alexander Motin almost 2 years ago

  • Status changed from Blocked to Closed
  • Target version changed from Backlog to N/A
  • Reason for Closing set to Cannot Reproduce
  • Reason for Blocked deleted (Need additional information from Author)
  • Needs QA changed from Yes to No
  • Needs Doc changed from Yes to No
  • Needs Merging changed from Yes to No

I'm sorry, but without reproduction I don't see where to start. Problem affecting even simple `cp` would probably be noticed more often if it would be reproducible.

#10 Updated by Dru Lavigne almost 2 years ago

  • File deleted (debug-nas-460-002-20180921204314.txz)

#11 Updated by Dru Lavigne almost 2 years ago

  • Private changed from Yes to No

Also available in: Atom PDF