Reputation: 3855
I want to calculate using spark and scala the h-ndex for a researcher (https://en.wikipedia.org/wiki/H-index) from a csv file with data in the format
R1:B, R1:A, R1:B, R2:C, R2:B, R2:A, R1:D, R1:B, R1:D, R2:B, R1:A, R1:B
The h-index is the academic indicator of a researcher and it is computed by creating a sinlge list for all reacerchers with their publications sorted e.g R1 : { A:10 , B:5 , C:1} and then finding the index of the the last position where a value is bigger than itsindex (here is position 2 because 1 < 3).
I cannot find a solution for spark using scala. Can anyone help?
Upvotes: 0
Views: 153
Reputation: 215117
In case you have a file like this:
R1:B, R1:A, R1:B, R2:C, R2:B, R2:A, R1:D, R1:B, R1:D, R2:B, R1:A, R1:B
R1:B, R1:A, R1:B, R2:C, R2:B, R2:A, R1:D, R1:B, R1:D, R2:B, R1:A, R1:B
R1:B, R1:A, R1:B, R2:C, R2:B, R2:A, R1:D, R1:B, R1:D, R2:B, R1:A, R1:B
Here are some thoughts:
// add a count field to each researcher:paper pair
input.flatMap(line => line.split(", ").map(_ -> 1)).
// count with research:paper as the key
reduceByKey(_+_).map{ case (ra, count) => {
// split research:paper
val Array(author, article) = ra.split(":")
// map so that the researcher will be new key
author -> (article, count)
// group result by the researcher
}}.groupByKey.collect
// res15: Array[(String, Iterable[(String, Int)])] = Array((R2,CompactBuffer((B,6), (A,3), (C,3))), (R1,CompactBuffer((A,6), (B,12), (D,6))))
Upvotes: 1