jason_1093
jason_1093

Reputation: 687

Drop rows containing specific value in PySpark dataframe

I have a pyspark dataframe like:

A    B      C
1    NA     9
4    2       5
6    4       2
5    1    NA

I want to delete rows which contain value "NA". In this case first and the last row. How to implement this using Python and Spark?


Update based on comment: Looking for a solution that removes rows that have the string: NA in any of the many columns.

Upvotes: 9

Views: 51788

Answers (3)

Ghias Ali
Ghias Ali

Reputation: 327

In case if you want to remove the row

df = df.filter((df.A != 'NA') | (df.B != 'NA'))

But sometimes we need to replace with mean(in case of numeric column) or most frequent value(in case of categorical). for that you need to add column with same name which replace the original column i-e "A"

from pyspark.sql.functions import mean,col,when,count
df=df.withColumn("A",when(df.A=="NA",mean(df.A)).otherwise(df.A))

Upvotes: 0

Ged
Ged

Reputation: 18013

In Scala I did this differently, but got to this using pyspark. Not my favourite answer, but it is because of lesser pyspark knowledge my side. Things seem easier in Scala. Unlike an array there is no global match against all columns that can stop as soon as one found. Dynamic in terms of number of columns.

Assumptions made on data not having ~~ as part of data, could have split to array but decided not to do here. Using None instead of NA.

from pyspark.sql import functions as f

data = [(1,    None,    4,    None),
        (2,    'c',     3,    'd'),
        (None, None,    None, None),
        (3,    None,    None, 'z')]
df = spark.createDataFrame(data, ['k', 'v1', 'v2', 'v3'])

columns = df.columns
columns_Count = len(df.columns)

# colCompare is String
df2 = df.select(df['*'], f.concat_ws('~~', *columns).alias('colCompare') )
df3 = df2.filter(f.size(f.split(f.col("colCompare"), r"~~"))  == columns_Count).drop("colCompare")
df3.show()

returns:

+---+---+---+---+
|  k| v1| v2| v3|
+---+---+---+---+
|  2|  c|  3|  d|
+---+---+---+---+

Upvotes: 0

cronoik
cronoik

Reputation: 19365

Just use a dataframe filter expression:

l = [('1','NA','9')
    ,('4','2', '5')
    ,('6','4','2')
    ,('5','NA','1')]
df = spark.createDataFrame(l,['A','B','C'])
#The following command requires that the checked columns are strings!
df = df.filter((df.A != 'NA') & (df.B != 'NA') & (df.C != 'NA'))
df.show()

+---+---+---+ 
|  A|  B|  C| 
+---+---+---+ 
|  4|  2|  5| 
|  6|  4|  2| 
+---+---+---+

@bluephantom: In the case you have hundreds of columns, just generate a string expression via list comprehension:

#In my example are columns need to be checked
listOfRelevantStringColumns = df.columns
expr = ' and '.join('(%s != "NA")' % col_name for col_name in listOfRelevantStringColumns)
df.filter(expr).show()

Upvotes: 13

Related Questions