Reputation: 31576
I am trying to process multiple avro files in the code below. the idea is to first get a series of avro files in a list. then open each avro file and generate a steam of tuples (string, int). then finally group the stream of tuples by key and sum the ints.
object AvroCopyUtil {
def main(args: Array[String]) : Unit = {
val conf = new SparkConf().setAppName("Leads Data Analysis").setMaster("local[*]")
val sc = new SparkContext(conf)
val fs = FileSystem.get(new Configuration())
val avroList = GetAvroList(fs, args(0))
avroList.flatMap(av =>
sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](av)
.map(r => (r._1.datum.get("field").toString, 1)))
.reduceByKey(_ + _)
.foreach(println)
}
def GetAvroList(fs: FileSystem, input: String) : List[String] = {
// get all children
val masterList : List[FileStatus] = fs.listStatus(new Path(input)).toList
val (allFiles, allDirs) = masterList.partition(x => x.isDirectory == false)
allFiles.map(_.getPath.toString) ::: allDirs.map(_.getPath.toString).flatMap(x => GetAvroList(fs, x))
}
}
The compile error i get is
[error] found : org.apache.spark.rdd.RDD[(org.apache.avro.mapred.AvroKey[org.apache.avro.generic.GenericRecord], org.apache.hadoop.io.NullWritable)]
[error] required: TraversableOnce[?]
[error] avroRdd.flatMap(av => sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](av))
[error] ^
[error] one error found
Edit: based on the suggestion below I tried
val rdd = sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable,
AvroKeyInputFormat[GenericRecord]](avroList.mkString(","))
but I got the error
Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 0: 2015-10-
15-00-1576041136-flumetracker.foo.com-FooAvroEvent.1444867200044.avro,hdfs:
Upvotes: 0
Views: 872
Reputation: 2442
Your function is unnecessary. You are also attempting to create an RDD within a transformation which doesn't really make sense. The transformation (in this case, flatMap
) runs on top of an RDD and the records within an RDD will be what is transformed. In the case of a flatMap
, the expected output of the anonymous function is a TraversableOnce
object which will then be flattened into multiple records by the transformation. Looking at your code though, you don't really need to do a flatMap
as a simply map
will suffice. Keep in mind also that due to the immutability of RDD's, you must always reassign your transformations into new values.
Try something like:
val avroRDD = sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](filePath)
val countsRDD = avroRDD.map(av => (av._1.datum.get("field1").toString, 1)).reduceByKey(_ + _)
It seems as though you may need to take some time to grasp some of Spark's basic framework nuances. I would recommend fully reading the Spark Programming Guide. Lastly, if you want to use Avro, please also check out spark-avro as much of the boiler plate around working with Avro is taken care of there (and DataFrames may perhaps be more intuitive and easier to use for your use case).
(EDIT:)
It seems like you may have misunderstood how to load data to be processed in Spark. The parallelize()
method is used to distribute collections across an RDD and not data within files. To do the latter, you actually only need to provide a comma-separated list of input files to the newAPIHadoopFile()
loader. So assuming your GetAvroList()
function works, you can do:
val avroList = GetAvroList(fs, args(0))
val avroRDD = sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](avroList.mkString(","))
val countsRDD = avroRDD.map(av => (av._1.datum.get("field1").toString, 1)).reduceByKey(_ + _)
flatMappedRDD.foreach(println)
Upvotes: 1