Reputation: 245
Spark Version: 2.1
Scala Version: 2.11
I have a dataframe with following structure before writing it and store into parquet file. It has lot of other columns but i cut it short to only 2 columns for clarity:
+---+--------------------+
|day| table_row |
+---+--------------------+
| 8|[,129,,,,,J,WENDI...|
| 8|[_DELETE_THIS_,_D...|
| 8|[_DELETE_THIS_,_D...|
...and the schema looks like this:
root
|-- day: long (nullable = true)
|-- table_row: struct (nullable = true)
| |-- DATE: string (nullable = true)
| |-- ADMISSION_NUM: string (nullable = true)
| |-- SOURCE_CODE: string (nullable = true)
etc..
'table_row' has over 100 data elements and i only posted a snippet. During processing i had to create couple of dummy rows with each field populated with "_DELETE_THIS_". For every normal row i have 2 dummy rows. Now i am trying to filter these dummy rows out of the dataframe and write only the valid rows but i am not able to do that using any means. I tried a couple ways but couldnt find a proper solution. Can someone help me this?
Thanks Qubiter
Upvotes: 0
Views: 2980
Reputation: 41957
You can use filter
function. You can take any field element from table_row as you said that each field is populated with _DELETE_THIS_
val finalDF = df.filter($"table_row.DATE" =!= "_DELETE_THIS_")
Here $"table_row.DATE"
is how you call DATE
element of the struct
column.
I hope the answer is helpful.
Upvotes: 1