Reputation: 366
I have two pyspark.sql.dataframe.DataFrame objects that are initialized as so (generalized example)
df = spark.table(path)
filtered_df = df.filter(df[col1].contains(value))
I can run the following code to determine the number of rows in df
:
print(df.count())
This is successful.
I can attempt to do the same for filtered_df
:
print(filtered_df.count())
However, this is NOT successful. I receive the following error message:
org.apache.spark.SparkException: Job aborted due to stage failure [...], most recent failure: [...]
com.databricks.sql.io.FileReadException: Error while reading file [...]
The problem I have with understanding the error is that the filtering done on df
should result in a smaller dataset. Thus, what could possibly be the problem such that we can apply count() on df
but not on filtered_df
? An error while reading the file? Then how was df
read, which contains the same data as filtered_df
?
Upvotes: 1
Views: 1922
Reputation: 1631
NOTE: The above code works for me on Databricks with following assumption i.e., value
is a single string or number.
You can try below two routes to filter dataframe element for a specific column.
Replace this section of your code:
filtered_df = df.filter(df[col1].contains(value))
with the below code and it should give you desired count.
import org.apache.spark.sql.functions.col
filtered_df = df.filter(col("col1").contains(value))
"You can also use below code to filter:"
filtered_df = df.filter(df.col1.isin(value))
You can also use
where
function to filter likeSQL
:
import pyspark.sql.functions as f
filtered_df = df.where(f.col("col1").contains(value))
Upvotes: 1