Reputation: 71
In a nutshell, I have a customer who wants the data contained in a bunch of ASCII text files (a.k.a "input files") ingested into Accumulo.
These files are output from diverse data feed devices and will be generated continuously on non-Hadoop/non-Accumulo node(s) (a.k.a "feed nodes"). The overall data throughput rate across all feeds is expected to be very high.
For the sake of simplicity, assume that all the data will end up in one forward index table and one inverted [reverse] index table in Accumulo.
I've already written an Accumulo client module using pyaccumulo that can establish a connection to Accumulo through the Thrift Proxy, read and parse the input files from a local filesystem (not HDFS), create the appropriate forward and reverse index mutations in code, and use BatchWriter to write the mutations to the forward and reverse index tables. So far, so good. But there's more to it.
From various sources, I've learned that there are at least a few standard approaches for Accumulo high speed ingest that might apply in my scenario, and I'm asking for some advice regarding what options make the most sense in terms of resource usage, and ease of implementation & maintenance. Here are some options:
Personally, I like option #2 the most, as long as the Accumulo master node can handle the processing load involved on its own (non-parallel input file parsing). The variant of #2 where I could run my Accumulo client on each Accumulo node, and send the output of different feed nodes to different Accumulo nodes, or round-robin, still has the disadvantage of sending the forward and reverse index mutations across the cloud network to the Accumulo master, but does have the advantage of performing the input file parsing more in parallel.
What I need to know is: Have I missed any viable options? Have I missed any advantages/disadvantages of each option? Are any of the advantages/disadvantages trivial or highly important regardless of my problem context, especially network bandwidth / CPU cycle / disk I/O tradeoffs? Is MapReduce with or without rfiles worth the trouble compared to BatchWriter? Does anyone have "war stories"?
Thanks!
Upvotes: 7
Views: 1103
Reputation: 271
Even with every use case, people have personal preferences regarding how they would like to implement a solution for a specific use case. I would actually run flume agents on the feed nodes and collect the data in HDFS and periodically run a MapReduce on the new data that arrives in HDFS using the RFile approach.
Upvotes: 1