Pitty
Pitty

Reputation: 329

Is there a command for downloading a web resource to hdfs?

I want to write a program for pushing a web resource onto hadoop. I use hadoop 2.2.0, and find that 'put' does not work like this:

hadoop fs -put http://localhost/test.log hdfs://localhost:9000/log/ 

Is there any way for putting the file to hdfs, without downloading it first?

PS: suppose I have no permmissions for the hadoop server and web resource server.

Upvotes: 2

Views: 3361

Answers (3)

Svend
Svend

Reputation: 7180

Piping the file as Jigar suggests works:

curl http://stackoverflow.com/questions/22188311/is-there-a-command-for-downloading-a-web-resource-to-hdfs | hadoop fs -appendToFile - question.html

Technically this use-case requires a unique "client" that connects to the remote URL as one single stream and pumps its content into HDFS. This command could be executed from one of the HDFS data nodes directly to avoid making the bytes transit to a supplementary client host. Network communication among HDFS nodes while downloading cannot be avoided anyway since the file will physically be stored in several nodes.

Upvotes: 6

scalauser
scalauser

Reputation: 1327

By using curl we can store the data into HDFS. Take a look at the following example using Java

public static void main(String[] args) throws IOException {
      URL url = new URL("http://example.com/feed/csv/month");
      HttpURLConnection conn = (HttpURLConnection)url.openConnection();
      conn.connect();
      InputStream connStream = conn.getInputStream();

      FileSystem hdfs = FileSystem.get(new Configuration());
      FSDataOutputStream outStream = hdfs.create(new Path(args[0], "month.txt"));
      IOUtils.copy(connStream, outStream);

      outStream.close();
      connStream.close();
      conn.disconnect();
}

Upvotes: 0

Jigar Parekh
Jigar Parekh

Reputation: 6273

I think you can use linux piping along with curl to download and store file to hdfs

Upvotes: 0

Related Questions