ypriverol
ypriverol

Reputation: 605

Parsing multiple files into SparkRDD

We have multiple files with a data structure like:

file1.txt

idUser: 34 
Name: User1
Activity: 34 
Comments: I like this 

idUser: 45
Name: User43
Activity: 12 
Comments: I don'y like this activity

file2.txt

idUser: 45
Name: User43
Activity: 678
Comments: I like this activity but not much 

We can have thousands of files and millions of records. We are planning to do data analysis in Spark with those files.

I have loaded my files like:

 JavaPairRDD<String, String> files = context.wholeTextFiles(inputPath); 

I would like to transform this data structure to JavaPairRDD<Integer, List<UserActivity>

Where user activity is the entries in each file. Anyone knows how to do this transformation? Does anyone know to do it faster and using partitioning because I have more than 500 millions records?

Upvotes: 0

Views: 154

Answers (2)

AlexM
AlexM

Reputation: 334

If you need to convert JavaPairRDD data structure, to a JavaPairRDD with a different data structure, you can do it using the .mapToPair() command.

For example:

JavaPairRDD<Integer, List<UserActivity>> newStruct = files.mapToPair(new MyConverter());

public class MyConverter implements PairFunction<Tuple2<Tuple2<String, String>, Long>, Integer, List<UserActivity>> {
    public Tuple2<Integer, List<UserActivity>> call(Tuple2<Tuple2<<String, String>, Long> val) {
        return ...
    }
}

Additional examples:

https://www.programcreek.com/java-api-examples/index.php?class=org.apache.spark.api.java.JavaRDD&method=mapToPair

Update:

The question was updated, so I'm updating my answer. With the current structure, it would look like:

JavaPairRDD<Integer, List<UserActivity>> newStruct = files.mapToPair(new MyConverter());

public class MyConverter implements PairFunction<Tuple2<String, String>, Integer, List<UserActivity>> {
    public Tuple2<Integer, List<UserActivity>> call(Tuple2<String, String> val) {
        return ...
    }
}

Upvotes: 1

Arthur
Arthur

Reputation: 38

Why do you want a JavaPairRDD<Integer, List<UserActivity>>? Don't you think that JavaPairRDD<Integer, UserActivity> would be enough? I think it will allow you to avoid many problems latter on.

If you want to transform a JavaPairRDD in another JavaPairRDD you can use a map, see this post

Upvotes: 1

Related Questions