Reputation: 880
I need to group a set of csv lines by a certain column and do some processing on each group.
JavaRDD<String> lines = sc.textFile
("somefile.csv");
JavaPairRDD<String, String> pairRDD = lines.mapToPair(new SomeParser());
List<String> keys = pairRDD.keys().distinct().collect();
for (String key : keys)
{
List<String> rows = pairRDD.lookup(key);
noOfVisits = rows.size();
country = COMMA.split(rows.get(0))[6];
accessDuration = getAccessDuration(rows,timeFormat);
Map<String,Integer> counts = getCounts(rows);
whitepapers = counts.get("whitepapers");
tutorials = counts.get("tutorials");
workshops = counts.get("workshops");
casestudies = counts.get("casestudies");
productPages = counts.get("productpages");
}
private static long dateParser(String dateString) throws ParseException {
SimpleDateFormat format = new SimpleDateFormat("MMM dd yyyy HH:mma");
Date date = format.parse(dateString);
return date.getTime();
}
dateParser is called for each row. Then min and max for the group is calculated to get the access duration. Others are string matches.
pairRDD.lookup is extremely slow.. Is there a better way to do this with spark.
Upvotes: 1
Views: 3095
Reputation: 37435
I think you could simply use that column as key and do a groupByKey
. There's no mention on the operation on those rows. If it's a function that combines those rows somehow, you could even use reduceByKey
.
Something like:
import org.apache.spark.SparkContext._ // implicit pair functions
val pairs = lines.map(parser _)
val grouped = pairs.groupByKey
// here grouped is of the form: (key, Iterator[String])
* EDIT *
After looking at the process, I think it would be more efficient to map each row into the data it contributes and then use aggregateByKey
to reduce them all to a total.
aggregateByKey
takes 2 functions and a zero:
def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
combOp: (U, U) => U): RDD[(K, U)]
The first function is a partition aggregator and will efficiently run through the local partitions, creating local aggregated partials per partition. The combineOperation will take those partial aggregations and combine them together to obtain a final result.
Something like this:
val lines = sc.textFile("somefile.csv")
// parse returns a key and a decomposed Record of values tracked:(key, Record("country", timestamp,"whitepaper",...))
val records = lines.map(parse(_))
val totals = records.aggregateByKey((0,Set[String].empty,Long.MaxValue, Long.MinValue, Map[String,Int].empty),
(record, (count, countrySet, minTime, maxTime, counterMap )) => (count+1,countrySet + record.country, math.min(minTime,record.timestamp), math.max(maxTime, record.timestamp), ...)
(cumm1, cumm2) => ??? // add each field of the cummulator
)
This is the most efficient method in Spark to do key-based aggregations.
Upvotes: 3