Vaquez Vincent
Vaquez Vincent

Reputation: 55

Quick way to delete empty column [PySpark]

Is there a easy way to drop empty column of a huge dataset (300+ col >100k row) in pyspark ? such as df.dropna(axis=1,how='all') in Python

Upvotes: 1

Views: 5181

Answers (2)

Eda
Eda

Reputation: 1

Here is an extended functionality of @pissall 's fn.:

def drop_null_columns(df, threshold=-1):
    """
    This function drops all columns which contain null values.
    If threshold is negative (default), drop columns that have only null values.
    If threshold is >=0, drop columns that have count of null values bigger than threshold. This may be very computationally expensive!
    Returns PySpark DataFrame.
    """
    if threshold<0:
        max_per_column = df.select([F.max(c).alias(c) for c in df.columns]).collect()[0].asDict()
        to_drop = [k for k, v in max_per_column.items() if v == None]
    else:
        null_counts = df.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df.columns]).collect()[0].asDict()
        to_drop = [k for k, v in null_counts.items() if v > threshold]
    df = df.drop(*to_drop)
    return df

Upvotes: 0

pissall
pissall

Reputation: 7419

Yes, you can simply use the answer from here. I've added a threshold parameter to it:

import pyspark.sql.functions as F

# Sample data
df = pd.DataFrame({'x1': ['a', '1', '2'],
                   'x2': ['b', None, '2'],
                   'x3': ['c', '0', '3'] })
df = sqlContext.createDataFrame(df)
df.show()

def drop_null_columns(df, threshold=0):
    """
    This function drops all columns which contain null values.
    :param df: A PySpark DataFrame
    """
    null_counts = df.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df.columns]).collect()[0].asDict()
    to_drop = [k for k, v in null_counts.items() if v > threshold]
    df = df.drop(*to_drop)
    return df

# Drops column b2, because it contains null values
drop_null_columns(df).show()

Output

+---+---+
| x1| x3|
+---+---+
|  a|  c|
|  1|  0|
|  2|  3|
+---+---+

Column x2 has been dropped.

You can use threshold=df.count() to while using it

Upvotes: 1

Related Questions