Spark Processing file with different structure

Question

My file contains multiple rows that have different structure. Each column is recognized by position depending on the type of row.

For example, we could have a file like this:

row_type1  first_name1 last_name1   info1   info2
row_type2 last_name1 first_name1 info3  info2
row_type3info4info1last_name1first_name1

We know the position of every column for every row type, we can use substring to get them.

The target dataframe will be "first_name1,last_name1,info1,info2,info3,info4) with no duplicated (first_name1,last_name1)

The info1 for example is duplicated in the first and 3rd row. I also need to choose which one I keep. For example if the info1 of the 1st row is empty or contains only 2 char I will choose info1 of the 3rd row.

I'm using Spark 2.2 + Scala 2.10.

I hope that my question is enough clear. Thank you for your time

Spark Processing file with different structure

Answers (1)

Related Questions