Reputation: 894
I have pyspark dataframe and it have 'n' number of rows with each row having one column result
The content of the result column is a JSON
{"crawlDate": "2019-07-03 20:03:44", "Code": "200", "c1": "XYZ", "desc": "desc", "attributes": {"abc":123, "def":456}}
{"crawlDate": "2019-07-04 20:03:44", "Code": "200", "c1": "ABC", "desc": "desc1"}
{"crawlDate": "2019-07-04 20:03:44", "Code": "200", "c1": "ABC", "desc": "desc1", "attributes": {"abc":456, "def":123}}
df.show():
Now I want to check how many records(ROWS) have attributes element and how many records don't have.
I tried to use array_contains, filter and explode functions in spark, but It didn't get the results.
Any suggestions please?
Upvotes: 0
Views: 1098
Reputation: 894
Below simple logic worked after lot of struggle
total_count = old_df.count()
new_df = old_df.filter(old_df.result.contains("attributes"))
success_count = new_df.count()
failure_count = total_count - success_count
Upvotes: 0
Reputation: 470
import org.apache.spark.sql.functions._
df.select(get_json_object($"result", "$.attributes").alias("attributes")) .filter(col("attributes").isNotNull).count()
with this logic, we can get the count of attribute existing records count
for your reference, please read this https://docs.databricks.com/spark/latest/dataframes-datasets/complex-nested-data.html
another solution if your input is JSON format, then
val df = spark.read.json("path of json file")
df.filter(col("attributes").isNotNull).count()
similar API we can get in python.
Upvotes: 1