Reputation:

getting data in and out of hadoop

I need a system to analyze large log files. A friend directed me to hadoop the other day and it seems perfect for my needs. My question revolves around getting data into hadoop-

Is it possible to have the nodes on my cluster stream data as they get it into HDFS? Or would each node need to write to a local temp file and submit the temp file after it reaches a certain size? and is it possible to append to a file in HDFS while also running queries/jobs on that same file at the same time?

Upvotes: 2

Answers (4)

Kazuki Ohta

Reputation: 1441

Fluentd log collector just released its WebHDFS plugin, which allows the users to instantly stream data into HDFS. It's really easy to install with ease of management.

enter image description here

Fluentd + Hadoop: Instant Big Data Collection

Of course you can import data directly from your applications. Here's a Java example to post logs against Fluentd.

Fluentd: Data Import from Java Applications

Upvotes: 2

Jeff Hammerbacher

Reputation: 4236

I'd recommend using Flume to collect the log files from your servers into HDFS.

Upvotes: 0

toluju

Reputation: 4107

A hadoop job can run over multiple input files, so there's really no need to keep all your data as one file. You won't be able to process a file until its file handle is properly closed, however.

Upvotes: 1

Eran Kampf

Reputation: 8986

HDFS does not support appends (yet?)

What I do is run the map-reduce job periodically and output results to an 'processed_logs_#{timestamp}" folder. Another job can later take these processed logs and push them to a database etc. so it can be queried on-line

Upvotes: 0

getting data in and out of hadoop

Answers (4)

Related Questions