Reputation: 329
I want to write a program for pushing a web resource onto hadoop. I use hadoop 2.2.0, and find that 'put' does not work like this:
hadoop fs -put http://localhost/test.log hdfs://localhost:9000/log/
Is there any way for putting the file to hdfs, without downloading it first?
PS: suppose I have no permmissions for the hadoop server and web resource server.
Upvotes: 2
Views: 3361
Reputation: 7180
Piping the file as Jigar suggests works:
curl http://stackoverflow.com/questions/22188311/is-there-a-command-for-downloading-a-web-resource-to-hdfs | hadoop fs -appendToFile - question.html
Technically this use-case requires a unique "client" that connects to the remote URL as one single stream and pumps its content into HDFS. This command could be executed from one of the HDFS data nodes directly to avoid making the bytes transit to a supplementary client host. Network communication among HDFS nodes while downloading cannot be avoided anyway since the file will physically be stored in several nodes.
Upvotes: 6
Reputation: 1327
By using curl we can store the data into HDFS. Take a look at the following example using Java
public static void main(String[] args) throws IOException {
URL url = new URL("http://example.com/feed/csv/month");
HttpURLConnection conn = (HttpURLConnection)url.openConnection();
conn.connect();
InputStream connStream = conn.getInputStream();
FileSystem hdfs = FileSystem.get(new Configuration());
FSDataOutputStream outStream = hdfs.create(new Path(args[0], "month.txt"));
IOUtils.copy(connStream, outStream);
outStream.close();
connStream.close();
conn.disconnect();
}
Upvotes: 0
Reputation: 6273
I think you can use linux piping along with curl
to download and store file to hdfs
Upvotes: 0