auxdx
auxdx

Reputation: 2493

How to check if spark dataframe is empty?

Right now, I have to use df.count > 0 to check if the DataFrame is empty or not. But it is kind of inefficient. Is there any better way to do that?

PS: I want to check if it's empty so that I only save the DataFrame if it's not empty

Upvotes: 175

Views: 236068

Answers (18)

Alex Raj Kaliamoorthy
Alex Raj Kaliamoorthy

Reputation: 2095

My case was a bit different and I want to share it with you all. My Dataframe was delivered empty however, there was a null value record. The dataframe is considered empty but it wasn't actually. Therefore I wrote the below code as a solution for my problem.

My Problem: When I issue df.count() I don't get 0 but one record with null values. If I issue df.rdd.isEmpty() I get False.

The Solution:

from pyspark.sql.functions import col,when
def isDfEmpty(df):
  if df.count() == 1: #When df has only one record
    _df_ = df.select([when(col(c)=="",None).otherwise(col(c)).alias(c) for c in df.columns]).na.drop('all')
    return(_df_.rdd.isEmpty())
  else:
    return False

isDfEmpty(df) #Replace df with your respective dataframe variable

Note: In my case I got only one record in the empty dataframe. If that is not the case please reconsider the if condition.

Upvotes: 0

Yasin Uygun
Yasin Uygun

Reputation: 59

If you want only to find out whether the DataFrame is empty, then df.isEmpty, df.head(1).isEmpty() or df.rdd.isEmpty() should work, these are taking a limit(1) if you examine them:

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)], output=[count#52L])
+- *(2) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#60L])
   +- *(2) GlobalLimit 1
      +- Exchange SinglePartition
         +- *(1) LocalLimit 1
            ... // the rest of the plan related to your computation

But if you are doing some other computation that requires a lot of memory and you don't want to cache your DataFrame just to check whether it is empty, then you can use an accumulator:

def accumulateRows(acc: LongAccumulator)(df: DataFrame): DataFrame =
  df.map { row => // we map to the same row, count during this map
    acc.add(1)
    row
  }(RowEncoder(df.schema))

val rowAccumulator = spark.sparkContext.longAccumulator("Row Accumulator")
val countedDF = df.transform(accumulateRows(rowAccumulator))
countedDF.write.saveAsTable(...) // main action
val isEmpty = rowAccumulator.isZero

Note that to see the row count, you should first perform the action. If we change the order of the last 2 lines, isEmpty will be true regardless of the computation.

Upvotes: 1

user3370741
user3370741

Reputation: 272

PySpark 3.3.0+ / Scala 2.4.0+

df.isEmpty()

Upvotes: 21

Adelholzener
Adelholzener

Reputation: 71

If you are using Pyspark, you could also do:

len(df.head(1)) > 0

Upvotes: 7

Joy Jedidja Ndjama
Joy Jedidja Ndjama

Reputation: 582

Let's suppose we have the following empty dataframe:

df = spark.sql("show tables").limit(0)

If you are using Spark 2.1, for pyspark, to check if this dataframe is empty, you can use:

df.count() > 0

Or

bool(df.head(1))

Upvotes: -2

aName
aName

Reputation: 3063

I had the same question, and I tested 3 main solution :

  1. (df != null) && (df.count > 0)
  2. df.head(1).isEmpty() as @hulin003 suggest
  3. df.rdd.isEmpty() as @Justin Pihony suggest

and of course the 3 works, however in term of perfermance, here is what I found, when executing the these methods on the same DF in my machine, in terme of execution time :

  1. it takes ~9366ms
  2. it takes ~5607ms
  3. it takes ~1921ms

therefore I think that the best solution is df.rdd.isEmpty() as @Justin Pihony suggest

Upvotes: 41

Jordan Morris
Jordan Morris

Reputation: 2301

dataframe.limit(1).count > 0

This also triggers a job but since we are selecting single record, even in case of billion scale records the time consumption could be much lower.

From: https://medium.com/checking-emptiness-in-distributed-objects/count-vs-isempty-surprised-to-see-the-impact-fa70c0246ee0

Upvotes: -2

Shaido
Shaido

Reputation: 28392

In Scala you can use implicits to add the methods isEmpty() and nonEmpty() to the DataFrame API, which will make the code a bit nicer to read.

object DataFrameExtensions {
  implicit def extendedDataFrame(dataFrame: DataFrame): ExtendedDataFrame = 
    new ExtendedDataFrame(dataFrame: DataFrame)

  class ExtendedDataFrame(dataFrame: DataFrame) {
    def isEmpty(): Boolean = dataFrame.head(1).isEmpty // Any implementation can be used
    def nonEmpty(): Boolean = !isEmpty
  }
}

Here, other methods can be added as well. To use the implicit conversion, use import DataFrameExtensions._ in the file you want to use the extended functionality. Afterwards, the methods can be used directly as so:

val df: DataFrame = ...
if (df.isEmpty) {
  // Do something
}

Upvotes: 4

Bose
Bose

Reputation: 41

On PySpark, you can also use this bool(df.head(1)) to obtain a True of False value

It returns False if the dataframe contains no rows

Upvotes: 4

Beryllium
Beryllium

Reputation: 13008

Since Spark 2.4.0 there is Dataset.isEmpty.

It's implementation is :

def isEmpty: Boolean = 
  withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan =>
    plan.executeCollect().head.getLong(0) == 0
}

Note that a DataFrame is no longer a class in Scala, it's just a type alias (probably changed with Spark 2.0):

type DataFrame = Dataset[Row]

Upvotes: 22

Nandakishore
Nandakishore

Reputation: 1011

If you do df.count > 0. It takes the counts of all partitions across all executors and add them up at Driver. This take a while when you are dealing with millions of rows.

The best way to do this is to perform df.take(1) and check if its null. This will return java.util.NoSuchElementException so better to put a try around df.take(1).

The dataframe return an error when take(1) is done instead of an empty row. I have highlighted the specific code lines where it throws the error.

enter image description here

Upvotes: 11

Abdennacer Lachiheb
Abdennacer Lachiheb

Reputation: 4888

For Java users you can use this on a dataset :

public boolean isDatasetEmpty(Dataset<Row> ds) {
        boolean isEmpty;
        try {
            isEmpty = ((Row[]) ds.head(1)).length == 0;
        } catch (Exception e) {
            return true;
        }
        return isEmpty;
}

This check all possible scenarios ( empty, null ).

Upvotes: 6

Shekhar Koirala
Shekhar Koirala

Reputation: 186

I found that on some cases:

>>>print(type(df))
<class 'pyspark.sql.dataframe.DataFrame'>

>>>df.take(1).isEmpty
'list' object has no attribute 'isEmpty'

this is same for "length" or replace take() by head()

[Solution] for the issue we can use.

>>>df.limit(2).count() > 1
False

Upvotes: 0

hulin003
hulin003

Reputation: 2685

For Spark 2.1.0, my suggestion would be to use head(n: Int) or take(n: Int) with isEmpty, whichever one has the clearest intent to you.

df.head(1).isEmpty
df.take(1).isEmpty

with Python equivalent:

len(df.head(1)) == 0  # or bool(df.head(1))
len(df.take(1)) == 0  # or bool(df.take(1))

Using df.first() and df.head() will both return the java.util.NoSuchElementException if the DataFrame is empty. first() calls head() directly, which calls head(1).head.

def first(): T = head()
def head(): T = head(1).head

head(1) returns an Array, so taking head on that Array causes the java.util.NoSuchElementException when the DataFrame is empty.

def head(n: Int): Array[T] = withAction("head", limit(n).queryExecution)(collectFromPlan)

So instead of calling head(), use head(1) directly to get the array and then you can use isEmpty.

take(n) is also equivalent to head(n)...

def take(n: Int): Array[T] = head(n)

And limit(1).collect() is equivalent to head(1) (notice limit(n).queryExecution in the head(n: Int) method), so the following are all equivalent, at least from what I can tell, and you won't have to catch a java.util.NoSuchElementException exception when the DataFrame is empty.

df.head(1).isEmpty
df.take(1).isEmpty
df.limit(1).collect().isEmpty

I know this is an older question so hopefully it will help someone using a newer version of Spark.

Upvotes: 220

Justin Pihony
Justin Pihony

Reputation: 67135

I would say to just grab the underlying RDD. In Scala:

df.rdd.isEmpty

in Python:

df.rdd.isEmpty()

That being said, all this does is call take(1).length, so it'll do the same thing as Rohan answered...just maybe slightly more explicit?

Upvotes: 64

sYer Wang
sYer Wang

Reputation: 1

You can do it like:

val df = sqlContext.emptyDataFrame
if( df.eq(sqlContext.emptyDataFrame) )
    println("empty df ")
else 
    println("normal df")

Upvotes: -2

Gopi A
Gopi A

Reputation: 9

df1.take(1).length>0

The take method returns the array of rows, so if the array size is equal to zero, there are no records in df.

Upvotes: -1

Rohan Aletty
Rohan Aletty

Reputation: 2442

You can take advantage of the head() (or first()) functions to see if the DataFrame has a single row. If so, it is not empty.

Upvotes: 16

Related Questions