Machetes0602
Machetes0602

Reputation: 366

Cannot get count() of PySpark dataframe after filtering

I have two pyspark.sql.dataframe.DataFrame objects that are initialized as so (generalized example)

df = spark.table(path)
filtered_df = df.filter(df[col1].contains(value))

I can run the following code to determine the number of rows in df:

print(df.count())

This is successful.

I can attempt to do the same for filtered_df:

print(filtered_df.count())

However, this is NOT successful. I receive the following error message:

org.apache.spark.SparkException: Job aborted due to stage failure [...], most recent failure: [...]
com.databricks.sql.io.FileReadException: Error while reading file [...]

The problem I have with understanding the error is that the filtering done on df should result in a smaller dataset. Thus, what could possibly be the problem such that we can apply count() on df but not on filtered_df? An error while reading the file? Then how was df read, which contains the same data as filtered_df?

Upvotes: 1

Views: 1922

Answers (1)

DataFramed
DataFramed

Reputation: 1631

Answer:

NOTE: The above code works for me on Databricks with following assumption i.e., value is a single string or number. You can try below two routes to filter dataframe element for a specific column.

Solution 1:

Replace this section of your code:

filtered_df = df.filter(df[col1].contains(value))

with the below code and it should give you desired count.

import org.apache.spark.sql.functions.col
filtered_df = df.filter(col("col1").contains(value))

Solution 2:

"You can also use below code to filter:"

filtered_df = df.filter(df.col1.isin(value))

Solution 3:

You can also use where function to filter like SQL:

import pyspark.sql.functions as f

filtered_df = df.where(f.col("col1").contains(value))

Upvotes: 1

Related Questions