Reputation: 503
I have all the pieces of a hadoop implementation ready - I have a running cluster, and a client writer that is pushing activity data into HDFS. I have a question about what happens next. I understand that we run jobs against the data that has been dumped into HDFS, but my questions are:
1) First off, I am writing into the stream and flushing periodically - I am writing the files via a thread in the HDFS java client, and I don't see the files appear in HDFS until I kill my server. If I write enough data to fill a block, will that automatically appear in the file system? How do I get to a point where I have files that are ready to be processed by M/R jobs?
2) When do we run M/R jobs? Like I said, I am writing the files via a thread in the HDFS java client, and that thread has a lock on the file for write. At what point should I release that file? How does this interaction work? At what point is it 'safe' to run a job against that data, and what happens to the data in HDFS when its done?
Upvotes: 0
Views: 166
Reputation: 8088
I would try to avoid "hard" synchronization between data insertion into hadoop and processing results. I mean that in many cases it is most practical to have to asynchronious processes:
a) One process putting files into HDFS. In many cases -building directory structure by dates is usefull.
b) Run jobs for all but most recent data.
You can run job on most recent data, but application should not relay on up to the minute results. In any case job usually takes more then a few minutes in any case
Another point - append is not 100% mainstream but advanced thing built for HBase. If you build your app without usage of it - you will be able to work with other DFS's like amazon s3 which do not support append. We are collecting data in local file system, and then copy them to HDFS when file is big enough.
Upvotes: 1
Reputation: 8881
write the data to fill a block , you will see the file in the system
M/R is submitted to the scheduler , which takes care of running it against data, we need not worry abt
Upvotes: 1