Reputation: 4640

pyspark: drop columns that have same values in all rows

So I have a pyspark dataframe, and I want to drop the columns where all values are the same in all rows while keeping other columns intact.

However the answers in the above question are only for pandas. Is there a solution for pyspark dataframe?

Thanks

Upvotes: 5

Answers (2)

cph_sto

Reputation: 7597

You can use approx_count_distinct function (link) to count the number of distinct elements in a column. In case there is just one distinct, the remove the corresponding column.

Creating the DataFrame

from pyspark.sql.functions import approx_count_distinct
myValues = [(1,2,2,0),(2,2,2,0),(3,2,2,0),(4,2,2,0),(3,1,2,0)]
df = sqlContext.createDataFrame(myValues,['value1','value2','value3','value4'])
df.show()
+------+------+------+------+
|value1|value2|value3|value4|
+------+------+------+------+
|     1|     2|     2|     0|
|     2|     2|     2|     0|
|     3|     2|     2|     0|
|     4|     2|     2|     0|
|     3|     1|     2|     0|
+------+------+------+------+

Couting number of distinct elements and converting it into dictionary.

count_distinct_df=df.select([approx_count_distinct(x).alias("{0}".format(x)) for x in df.columns])
count_distinct_df.show()
+------+------+------+------+
|value1|value2|value3|value4|
+------+------+------+------+
|     4|     2|     1|     1|
+------+------+------+------+
dict_of_columns = count_distinct_df.toPandas().to_dict(orient='list')
dict_of_columns
    {'value1': [4], 'value2': [2], 'value3': [1], 'value4': [1]}

#Storing those keys in the list which have just 1 distinct key.
distinct_columns=[k for k,v in dict_of_columns.items() if v == [1]]
distinct_columns
    ['value3', 'value4']

Drop the columns having distinct values

df=df.drop(*distinct_columns)
df.show()
+------+------+
|value1|value2|
+------+------+
|     1|     2|
|     2|     2|
|     3|     2|
|     4|     2|
|     3|     1|
+------+------+

Upvotes: 1

Manoj Singh

Reputation: 1737

You can apply the countDistinct() aggregation function on each column to get count of distinct values per column. Column with count=1 means it has only 1 value in all rows.

# apply countDistinct on each column
col_counts = df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)).collect()[0].asDict()

# select the cols with count=1 in an array
cols_to_drop = [col for col in df.columns if col_counts[col] == 1 ]

# drop the selected column
df.drop(*cols_to_drop).show()

Upvotes: 9

pyspark: drop columns that have same values in all rows

Answers (2)

Related Questions