Praveen Mandadi
Praveen Mandadi

Reputation: 381

How to iterate the JavaRDD using foreach and find the particular element from each line using spark java

I'm having these lines in my text file:

Some different lines....

Name : Praveen  
Age : 24  
Contact : 1234567890  
Location : India  

Some different lines....

Name : John  
Contact : 1234567890  
Location : UK  

Some different lines....  

Name : Joe  
Age : 54  
Contact : 1234567890  
Location : US  

Some different lines indicates there is some other information in between.

Now I need to read the file and extract the person info. If any key is missing it should be read as empty string (Age is missing in the second person info).

JavaRDD<String> data = jsc.textFile("person.report");

List<String> name = data.filter(f -> f.contains("Name")).collect();
List<String> age = data.filter(f -> f.contains("Age")).collect();
List<String> contact = data.filter(f -> f.contains("Contact")).collect();
List<String> location = data.filter(f -> f.contains("Location")).collect();

When I do in the above way and iterate over a for loop, the age of 3rd person is getting assigned to 2nd person.

Upvotes: 3

Views: 2226

Answers (1)

Oli
Oli

Reputation: 10406

First, you are collecting everything on the driver, are you sure that's what you want to do? It will not work with a big dataset...

Basically, your problem is that what you consider to be a record is not on a single line. By default, spark considers each line to be a separate record. Yet here, your records are on several lines (name, age, location...). To overcome this, you need to find another delimiter. If in "Some different lines", there is a specific string, use that and set this property:

sc.hadoopConfiguration.set("textinputformat.record.delimiter","specific string")

Then you could write something like:

val cols = Seq("Name","Age", "Contact", "Location")
sc.textFile("...")
  .map( _.split("\n"))
  .map(x => cols
       .map( col => x.find(_.startsWith(col)).getOrElse(col+" :") ) )

All the lines corresponding to a person will be in the same record for you to process as you wish. If you cannot find any suitable delimiter, your records might all have a Name so you could probably use "Name : ".

In java8 you can use streams to implement it the same way. This is a bit more verbose but since the question was asked for java, there you go:

String[] array = {"Name", "Age", "Contact", "Location"};
List<String> list = Arrays.asList(array);
sc.textFile("...")
    .map(x -> Arrays.asList(x.split("\n")))
    .map(x -> list.stream()
                  .map(col -> x.stream()
                               .filter(line -> line.startsWith(col))
                               .findAny()
                               .orElse(col+" :"))
                  .collect(Collectors.toList()) );

Upvotes: 3

Related Questions