Reputation: 319
I'm trying to find missing and null values from my dataframe but I'm getting an exception. I have included only the initial few schema below:
root
|-- created_at: string (nullable = true)
|-- id: long (nullable = true)
|-- id_str: string (nullable = true)
|-- text: string (nullable = true)
|-- display_text_range: string (nullable = true)
|-- source: string (nullable = true)
|-- truncated: boolean (nullable = true)
|-- in_reply_to_status_id: double (nullable = true)
|-- in_reply_to_status_id_str: string (nullable = true)
|-- in_reply_to_user_id: double (nullable = true)
|-- in_reply_to_user_id_str: string (nullable = true)
|-- in_reply_to_screen_name: string (nullable = true)
|-- geo: double (nullable = true)
|-- coordinates: double (nullable = true)
|-- place: double (nullable = true)
|-- contributors: string (nullable = true)
Here is the code which throws the exception. I'm trying to find missing and null values here.
df_mis = df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns])
df_mis.show()
Here is the AnalysisException details:
---------------------------------------------------------------------------
AnalysisException Traceback (most recent call last)
<ipython-input-20-6ccaacbbcc7f> in <module>()
----> 1 df_mis = df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns])
2 df_mis.show()
2 frames
/content/spark-3.2.0-bin-hadoop3.2/python/pyspark/sql/dataframe.py in select(self, *cols)
1683 [Row(name='Alice', age=12), Row(name='Bob', age=15)]
1684 """
-> 1685 jdf = self._jdf.select(self._jcols(*cols))
1686 return DataFrame(jdf, self.sql_ctx)
1687
/content/spark-3.2.0-bin-hadoop3.2/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py in __call__(self, *args)
1308 answer = self.gateway_client.send_command(command)
1309 return_value = get_return_value(
-> 1310 answer, self.gateway_client, self.target_id, self.name)
1311
1312 for temp_arg in temp_args:
/content/spark-3.2.0-bin-hadoop3.2/python/pyspark/sql/utils.py in deco(*a, **kw)
115 # Hide where the exception came from that shows a non-Pythonic
116 # JVM exception message.
--> 117 raise converted from None
118 else:
119 raise
AnalysisException: Can't extract value from place#14: need struct type but got double
Upvotes: 3
Views: 21179
Reputation: 319
I solved this issue by replacing dots "." in column names with underscores. I found the following stackoverflow post to be very helpful. To quote from the post, "The error is there because (.)dot is used to access a struct field".
Extracting value from data frame thorws error because of the . in the column name in spark
Upvotes: 7