lampShadesDrifter
lampShadesDrifter

Reputation: 4139

Input / Output error when using HDFS NFS Gateway

Getting "Input / output error" when trying work with files in mounted HDFS NFS Gateway. This is despite having set dfs.namenode.accesstime.precision=3600000 in Ambari. For example, doing something like...

$ hdfs dfs -cat /hdfs/path/to/some/tsv/file | sed -e "s/$NULL_WITH_TAB/$TAB/g" | hadoop fs -put -f - /hdfs/path/to/some/tsv/file
$ echo -e "Lines containing null (expect zero): $(grep -c "\tnull\t" /nfs/hdfs/path/to/some/tsv/file)"

when trying to remove nulls from a tsv then inspect for nulls in that tsv based on the NFS location throws the error, but I am seeing it in many other places (again, already have dfs.namenode.accesstime.precision=3600000). Anyone have any ideas why this may be happening or debugging suggestions? Can anyone explain what exactly "access time" is in this context?

Upvotes: 0

Views: 542

Answers (1)

lampShadesDrifter
lampShadesDrifter

Reputation: 4139

From discussion on the apache hadoop mailing list:

I think access time refers to the POSIX atime attribute for files, the “time of last access” as described here for instance (https://www.unixtutorial.org/atime-ctime-mtime-in-unix-filesystems). While HDFS keeps a correct modification time (mtime), which is important, easy and cheap, it only keeps a very low-resolution sense of last access time, which is less important, and expensive to monitor and record, as described here (https://issues.apache.org/jira/browse/HADOOP-1869) and here (https://superuser.com/questions/464290/why-is-cat-not-changing-the-access-time).

However, to have a conforming NFS api, you must present atime, and so the HDFS NFS implementation does. But first you have to configure it on. [...] many sites have been advised to turn it off entirely by setting it to zero, to improve HDFS overall performance. See for example here ( https://community.hortonworks.com/articles/43861/scaling-the-hdfs-namenode-part-4-avoiding-performa.html, section "Don’t let Reads become Writes”). So if your site has turned off atime in HDFS, you will need to turn it back on to fully enable NFS. Alternatively, you can maintain optimum efficiency by mounting NFS with the “noatime” option, as described in the document you reference.

[...] check under /var/log, eg with find /var/log -name ‘*nfs3*’ -print

Upvotes: 0

Related Questions