Reputation: 576
I have to get rid of columns that don't add information to my dataset, i.e. columns with the same values in all the entries.
I devised two ways of doing this
for col in df.columns:
if df.agg(F.min(col)).collect()[0][0] == df.agg(F.max(col)).collect()[0][0]:
df = df.drop(col)
for col in df.columns:
if df.select(col).distinct().count() == 1:
df = df.drop(col)
Is there a better, faster or more straight forward way to do this?
Upvotes: 2
Views: 4228
Reputation: 7207
I prefer to use the subtract method.
df1 = # Sample DataFrame #1
df2 = # Sample DataFrame #2
assert 0 == df1.subtract(df2).count()
assert 0 == df2.subtract(df1).count()
Another way is to check the union.
assert df1.count() == df1.union(df2).count()
assert df2.count() == df1.union(df2).count()
Upvotes: 0
Reputation: 179
df = df.drop(*(col for col in df.columns if df.select(col).distinct().count() == 1))
Upvotes: 5