qubiter
qubiter

Reputation: 245

How to filter a dataframe column containing Array/Struct

Spark Version: 2.1

Scala Version: 2.11

I have a dataframe with following structure before writing it and store into parquet file. It has lot of other columns but i cut it short to only 2 columns for clarity:

+---+--------------------+
|day|   table_row        |
+---+--------------------+
|  8|[,129,,,,,J,WENDI...|
|  8|[_DELETE_THIS_,_D...|
|  8|[_DELETE_THIS_,_D...|

...and the schema looks like this:

     root 
     |-- day: long (nullable = true)
     |-- table_row: struct (nullable = true)
     |    |-- DATE: string (nullable = true)
     |    |-- ADMISSION_NUM: string (nullable = true)
     |    |-- SOURCE_CODE: string (nullable = true)
etc..

'table_row' has over 100 data elements and i only posted a snippet. During processing i had to create couple of dummy rows with each field populated with "_DELETE_THIS_". For every normal row i have 2 dummy rows. Now i am trying to filter these dummy rows out of the dataframe and write only the valid rows but i am not able to do that using any means. I tried a couple ways but couldnt find a proper solution. Can someone help me this?

Thanks Qubiter

Upvotes: 0

Views: 2980

Answers (1)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

You can use filter function. You can take any field element from table_row as you said that each field is populated with _DELETE_THIS_

val finalDF = df.filter($"table_row.DATE" =!= "_DELETE_THIS_")

Here $"table_row.DATE" is how you call DATE element of the struct column.

I hope the answer is helpful.

Upvotes: 1

Related Questions