Reputation: 946
My file contains multiple rows that have different structure. Each column is recognized by position depending on the type of row.
For example, we could have a file like this:
row_type1 first_name1 last_name1 info1 info2
row_type2 last_name1 first_name1 info3 info2
row_type3info4info1last_name1first_name1
We know the position of every column for every row type, we can use substring to get them.
The target dataframe will be "first_name1,last_name1,info1,info2,info3,info4) with no duplicated (first_name1,last_name1)
The info1 for example is duplicated in the first and 3rd row. I also need to choose which one I keep. For example if the info1 of the 1st row is empty or contains only 2 char I will choose info1 of the 3rd row.
I'm using Spark 2.2 + Scala 2.10.
I hope that my question is enough clear. Thank you for your time
Upvotes: 0
Views: 71
Reputation: 1181
Use RDD.map to transform each record to standard format. Then write an aggregation function for aggregating all info columns. You can put your logic for info columns in that. Aggregate records with key (first_name, last_name) and calling aggregation function for info columns.
Upvotes: 1