Discussion:
Can't open file on NFS that was just written by other machine
(too old to reply)
e***@gmail.com
2008-08-08 12:52:31 UTC
Permalink
My problem is as follows:

I have my application running on a LSF cluster (128+ nodes) that write
files to a NFS disk.
As soon as the job has finished the files are visible with ls, but
have a file size of 0 for some time after the job is finished. A
management process running on another machine tries to open the files
as soon as LSF tells that all jobs have been finished, but the files
can not be opened by this management process yet. It sometimes take 10
seconds before the files can be opened by this process.
Michael D. Ober
2008-08-10 16:47:40 UTC
Permalink
Post by e***@gmail.com
I have my application running on a LSF cluster (128+ nodes) that write
files to a NFS disk.
As soon as the job has finished the files are visible with ls, but
have a file size of 0 for some time after the job is finished. A
management process running on another machine tries to open the files
as soon as LSF tells that all jobs have been finished, but the files
can not be opened by this management process yet. It sometimes take 10
seconds before the files can be opened by this process.
This is a result of the NFS protocol. NFS servers will keep files locked
for a period of time after the file write ends. In your case it sounds like
this period is 10 seconds.

Mike.
bcwalrus
2008-09-16 17:46:31 UTC
Permalink
Post by e***@gmail.com
I have my application running on a LSF cluster (128+ nodes) that write
files to a NFS disk.
As soon as the job has finished the files are visible with ls, but
have a file size of 0 for some time after the job is finished. A
management process running on another machine tries to open the files
as soon as LSF tells that all jobs have been finished, but the files
can not be opened by this management process yet. It sometimes take 10
seconds before the files can be opened by this process.
This is a result of the NFS protocol.  NFS servers will keep files locked
for a period of time after the file write ends.  In your case it sounds like
this period is 10 seconds.
Really? The NFS protocol doesn't say that a server can lock the file
without the client telling it to. First, there is no reason to do so.
Second,
it would seriously kill performance.

Eric, I think you might have a client caching problem. The client
caches
file attributes for <n> seconds, which is typically configurable with
the
"actimeo=<n>" mount option. The caching is there to improve local
access
latency and so that the client doesn't generate too much traffic.

The "file size of 0" suggests that the management process is still
using
the cached file attributes, even though the file has already changed.
So
you may want to look into reducing the cache timeout value.

Cheers,
bc

Loading...