Reputation: 457
My Hadoop version is - 2.6.0 -cdh5.10.0 I am using a Cloudera Vm.
I am trying to access the hdfs file system through my code to access the files and add it as input or a cache file.
When I try to access the hdfs file through command line am able to list the files.
Command :
[cloudera@quickstart java]$ hadoop fs -ls hdfs://localhost:8020/user/cloudera
Found 5items
-rw-r--r-- 1 cloudera cloudera 106 2017-02-19 15:48 hdfs://localhost:8020/user/cloudera/test
drwxr-xr-x - cloudera cloudera 0 2017-02-19 15:42 hdfs://localhost:8020/user/cloudera/test_op
drwxr-xr-x - cloudera cloudera 0 2017-02-19 15:49 hdfs://localhost:8020/user/cloudera/test_op1
drwxr-xr-x - cloudera cloudera 0 2017-02-19 15:12 hdfs://localhost:8020/user/cloudera/wc_output
drwxr-xr-x - cloudera cloudera 0 2017-02-19 15:16 hdfs://localhost:8020/user/cloudera/wc_output1
When I try to access the same thing through my map reduce program,I am receiving File Not Found exception. My Map reduce sample configuration code is :
public int run(String[] args) throws Exception {
Configuration conf = getConf();
if (args.length != 2) {
System.err.println("Usage: test <in> <out>");
System.exit(2);
}
ConfigurationUtil.dumpConfigurations(conf, System.out);
LOG.info("input: " + args[0] + " output: " + args[1]);
Job job = Job.getInstance(conf);
job.setJobName("test");
job.setJarByClass(Driver.class);
job.setMapperClass(Mapper.class);
job.setReducerClass(Reducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);
job.addCacheFile(new Path("hdfs://localhost:8020/user/cloudera/test/test.tsv").toUri());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
boolean result = job.waitForCompletion(true);
return (result) ? 0 : 1;
}
The line job.addCacheFile in the above snippet returns FileNotFound Exception.
2)My second question is :
My entry at core-site.xml points to localhost:9000 for default hdfs file system URI.But at the command prompt am able to access the default hdfs file system only at port 8020 and not at 9000.when I tried using port 9000,I ended up with ConnectionRefused Exception. I am not sure from where the configurations are read.
My core-site.xml is as follows :
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!--
<property>
<name>hadoop.tmp.dir</name>
<value>/Users/student/tmp/hadoop-local/tmp</value>
<description>A base for other temporary directories.</description>
</property>
-->
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
<description>Default file system URI. URI:scheme://authority/path scheme:method of access authority:host,port etc.</description>
</property>
</configuration>
My hdfs-site.xml is as follows :
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/tmp/hdfs/name</value>
<description>Determines where on the local filesystem the DFS name
node should store the name table(fsimage).</description>
</property>
<property>
<name>dfs.data.dir</name>
<value>/tmp/hdfs/data</value>
<description>Determines where on the local filesystem an DFS data node should store its blocks.</description>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.Usually 3, 1 in our case
</description>
</property>
</configuration>
I am receiving the following exception :
java.io.FileNotFoundException: hdfs:/localhost:8020/user/cloudera/test/ (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:146)
at java.io.FileInputStream.<init>(FileInputStream.java:101)
at java.io.FileReader.<init>(FileReader.java:58)
at hadoop.TestDriver$ActorWeightReducer.setup(TestDriver.java:104)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:168)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Any help will be useful!
Upvotes: 0
Views: 1261
Reputation: 1006
you are not required to give full path as an argument for accessing the file from hdfs. Namenode on it's own (from core-site.xml) will add the prefix of hdfs://host_address. You just need to mention the file you want to access along with the directory structure in your case which should be /user/cloudera/test
.
Coming to your 2 question port no 8020 is the default port for hdfs. That is why you are able to access the hdfs at port 8020 even when you did not mention it. The reason for the connectionrefused exception is because hdfs get started at 8020 that is why port 9000 is not expecting any request thus it refused the connection.
refer here for more details about default ports
Upvotes: 0