razor
razor

Reputation:

getting data in and out of hadoop

I need a system to analyze large log files. A friend directed me to hadoop the other day and it seems perfect for my needs. My question revolves around getting data into hadoop-

Is it possible to have the nodes on my cluster stream data as they get it into HDFS? Or would each node need to write to a local temp file and submit the temp file after it reaches a certain size? and is it possible to append to a file in HDFS while also running queries/jobs on that same file at the same time?

Upvotes: 2

Views: 4241

Answers (4)

Kazuki Ohta
Kazuki Ohta

Reputation: 1441

Fluentd log collector just released its WebHDFS plugin, which allows the users to instantly stream data into HDFS. It's really easy to install with ease of management.

enter image description here

Of course you can import data directly from your applications. Here's a Java example to post logs against Fluentd.

Upvotes: 2

Jeff Hammerbacher
Jeff Hammerbacher

Reputation: 4236

I'd recommend using Flume to collect the log files from your servers into HDFS.

Upvotes: 0

toluju
toluju

Reputation: 4107

A hadoop job can run over multiple input files, so there's really no need to keep all your data as one file. You won't be able to process a file until its file handle is properly closed, however.

Upvotes: 1

Eran Kampf
Eran Kampf

Reputation: 8986

HDFS does not support appends (yet?)

What I do is run the map-reduce job periodically and output results to an 'processed_logs_#{timestamp}" folder. Another job can later take these processed logs and push them to a database etc. so it can be queried on-line

Upvotes: 0

Related Questions