Reputation: 1711

How to delete columns in pyspark dataframe

>>> a
DataFrame[id: bigint, julian_date: string, user_id: bigint]
>>> b
DataFrame[id: bigint, quan_created_money: decimal(10,0), quan_created_cnt: bigint]
>>> a.join(b, a.id==b.id, 'outer')
DataFrame[id: bigint, julian_date: string, user_id: bigint, id: bigint, quan_created_money: decimal(10,0), quan_created_cnt: bigint]

There are two id: bigint and I want to delete one. How can I do?

Upvotes: 170

Answers (9)

Patrick C.

Reputation: 2411

Reading the Spark documentation I found an easier solution.

Since version 1.4 of spark there is a function drop(col) which can be used in pyspark on a dataframe.

You can use it in two ways

df.drop('age')
df.drop(df.age)

Pyspark Documentation - Drop

Upvotes: 206

kyramichel

Reputation: 491

Yes, it is possible to drop/select columns by slicing like this:

slice = data.columns[a:b]

data.select(slice).show()

Example:

newDF = spark.createDataFrame([
                           (1, "a", "4", 0), 
                            (2, "b", "10", 3), 
                            (7, "b", "4", 1), 
                            (7, "d", "4", 9)],
                            ("id", "x1", "x2", "y"))


slice = newDF.columns[1:3]
newDF.select(slice).show()

Use select method to get features column:

features = newDF.columns[:-1]
newDF.select(features).show()

Use drop method to get last column:

last_col= newDF.drop(*features)
last_col.show()

Upvotes: 2

techgeek

Reputation: 43

You can delete column like this:

df.drop("column Name).columns

In your case :

df.drop("id").columns

If you want to drop more than one column you can do:

dfWithLongColName.drop("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME")

Upvotes: 0

New Contributer

Reputation: 509

Consider 2 dataFrames:

>>> aDF.show()
+---+----+
| id|datA|
+---+----+
|  1|  a1|
|  2|  a2|
|  3|  a3|
+---+----+

and

>>> bDF.show()
+---+----+
| id|datB|
+---+----+
|  2|  b2|
|  3|  b3|
|  4|  b4|
+---+----+

To accomplish what you are looking for, there are 2 ways:

1. Different joining condition. Instead of saying aDF.id == bDF.id

aDF.join(bDF, aDF.id == bDF.id, "outer")

Write this:

aDF.join(bDF, "id", "outer").show()
+---+----+----+
| id|datA|datB|
+---+----+----+
|  1|  a1|null|
|  3|  a3|  b3|
|  2|  a2|  b2|
|  4|null|  b4|
+---+----+----+

This will automatically get rid of the extra the dropping process.

2. Use Aliasing: You will lose data related to B Specific Id's in this.

>>> from pyspark.sql.functions import col
>>> aDF.alias("a").join(bDF.alias("b"), aDF.id == bDF.id, "outer").drop(col("b.id")).show()

+----+----+----+
|  id|datA|datB|
+----+----+----+
|   1|  a1|null|
|   3|  a3|  b3|
|   2|  a2|  b2|
|null|null|  b4|
+----+----+----+

Upvotes: -1

Clock Slave

Reputation: 7977

Adding to @Patrick's answer, you can use the following to drop multiple columns

columns_to_drop = ['id', 'id_copy']
df = df.drop(*columns_to_drop)

Upvotes: 170

Aron Asztalos

Reputation: 854

You can use two way:

1: You just keep the necessary columns:

drop_column_list = ["drop_column"]
df = df.select([column for column in df.columns if column not in drop_column_list])

2: This is the more elegant way.

df = df.drop("col_name")

You should avoid the collect() version, because it will send to the master the complete dataset, it will take a big computing effort!

Upvotes: 25

ev.per.baryon

Reputation: 361

An easy way to do this is to user "select" and realize you can get a list of all columns for the dataframe, df, with df.columns

drop_list = ['a column', 'another column', ...]

df.select([column for column in df.columns if column not in drop_list])

Upvotes: 36

Yuri Brovman

Reputation: 1113

Maybe a little bit off topic, but here is the solution using Scala. Make an Array of column names from your oldDataFrame and delete the columns that you want to drop ("colExclude"). Then pass the Array[Column] to select and unpack it.

val columnsToKeep: Array[Column] = oldDataFrame.columns.diff(Array("colExclude"))
                                               .map(x => oldDataFrame.col(x))
val newDataFrame: DataFrame = oldDataFrame.select(columnsToKeep: _*)

Upvotes: 4

karlson

Reputation: 5433

You could either explicitly name the columns you want to keep, like so:

keep = [a.id, a.julian_date, a.user_id, b.quan_created_money, b.quan_created_cnt]

Or in a more general approach you'd include all columns except for a specific one via a list comprehension. For example like this (excluding the id column from b):

keep = [a[c] for c in a.columns] + [b[c] for c in b.columns if c != 'id']

Finally you make a selection on your join result:

d = a.join(b, a.id==b.id, 'outer').select(*keep)

Upvotes: 14

How to delete columns in pyspark dataframe

Answers (9)

Related Questions