data_addict
data_addict

Reputation: 894

Filter valid and invalid records in Spark

I have pyspark dataframe and it have 'n' number of rows with each row having one column result

The content of the result column is a JSON

{"crawlDate": "2019-07-03 20:03:44", "Code": "200", "c1": "XYZ", "desc": "desc",  "attributes": {"abc":123, "def":456}}
{"crawlDate": "2019-07-04 20:03:44", "Code": "200", "c1": "ABC", "desc": "desc1"}
{"crawlDate": "2019-07-04 20:03:44", "Code": "200", "c1": "ABC", "desc": "desc1", "attributes": {"abc":456, "def":123}}

df.show():

enter image description here

Now I want to check how many records(ROWS) have attributes element and how many records don't have.

I tried to use array_contains, filter and explode functions in spark, but It didn't get the results.

Any suggestions please?

Upvotes: 0

Views: 1098

Answers (2)

data_addict
data_addict

Reputation: 894

Below simple logic worked after lot of struggle

total_count = old_df.count()
new_df = old_df.filter(old_df.result.contains("attributes"))
success_count = new_df.count()
failure_count = total_count - success_count

Upvotes: 0

Ravi
Ravi

Reputation: 470

import org.apache.spark.sql.functions._

df.select(get_json_object($"result", "$.attributes").alias("attributes")) .filter(col("attributes").isNotNull).count()

with this logic, we can get the count of attribute existing records count

for your reference, please read this https://docs.databricks.com/spark/latest/dataframes-datasets/complex-nested-data.html

another solution if your input is JSON format, then

val df = spark.read.json("path of json file")
df.filter(col("attributes").isNotNull).count()

similar API we can get in python.

Upvotes: 1

Related Questions