Reputation: 605
We have multiple files with a data structure like:
file1.txt
idUser: 34
Name: User1
Activity: 34
Comments: I like this
idUser: 45
Name: User43
Activity: 12
Comments: I don'y like this activity
file2.txt
idUser: 45
Name: User43
Activity: 678
Comments: I like this activity but not much
We can have thousands of files and millions of records. We are planning to do data analysis in Spark with those files.
I have loaded my files like:
JavaPairRDD<String, String> files = context.wholeTextFiles(inputPath);
I would like to transform this data structure to JavaPairRDD<Integer, List<UserActivity>
Where user activity is the entries in each file. Anyone knows how to do this transformation? Does anyone know to do it faster and using partitioning because I have more than 500 millions records?
Upvotes: 0
Views: 154
Reputation: 334
If you need to convert JavaPairRDD data structure, to a JavaPairRDD with a different data structure, you can do it using the .mapToPair()
command.
For example:
JavaPairRDD<Integer, List<UserActivity>> newStruct = files.mapToPair(new MyConverter());
public class MyConverter implements PairFunction<Tuple2<Tuple2<String, String>, Long>, Integer, List<UserActivity>> {
public Tuple2<Integer, List<UserActivity>> call(Tuple2<Tuple2<<String, String>, Long> val) {
return ...
}
}
Additional examples:
Update:
The question was updated, so I'm updating my answer. With the current structure, it would look like:
JavaPairRDD<Integer, List<UserActivity>> newStruct = files.mapToPair(new MyConverter());
public class MyConverter implements PairFunction<Tuple2<String, String>, Integer, List<UserActivity>> {
public Tuple2<Integer, List<UserActivity>> call(Tuple2<String, String> val) {
return ...
}
}
Upvotes: 1
Reputation: 38
Why do you want a JavaPairRDD<Integer, List<UserActivity>>
? Don't you think that JavaPairRDD<Integer, UserActivity>
would be enough? I think it will allow you to avoid many problems latter on.
If you want to transform a JavaPairRDD in another JavaPairRDD you can use a map, see this post
Upvotes: 1