Reputation: 1711
>>> a
DataFrame[id: bigint, julian_date: string, user_id: bigint]
>>> b
DataFrame[id: bigint, quan_created_money: decimal(10,0), quan_created_cnt: bigint]
>>> a.join(b, a.id==b.id, 'outer')
DataFrame[id: bigint, julian_date: string, user_id: bigint, id: bigint, quan_created_money: decimal(10,0), quan_created_cnt: bigint]
There are two id: bigint
and I want to delete one. How can I do?
Upvotes: 170
Views: 436640
Reputation: 2411
Reading the Spark documentation I found an easier solution.
Since version 1.4 of spark there is a function drop(col)
which can be used in pyspark on a dataframe.
You can use it in two ways
df.drop('age')
df.drop(df.age)
Upvotes: 206
Reputation: 491
Yes, it is possible to drop/select columns by slicing like this:
slice = data.columns[a:b]
data.select(slice).show()
Example:
newDF = spark.createDataFrame([
(1, "a", "4", 0),
(2, "b", "10", 3),
(7, "b", "4", 1),
(7, "d", "4", 9)],
("id", "x1", "x2", "y"))
slice = newDF.columns[1:3]
newDF.select(slice).show()
Use select method to get features column:
features = newDF.columns[:-1]
newDF.select(features).show()
Use drop method to get last column:
last_col= newDF.drop(*features)
last_col.show()
Upvotes: 2
Reputation: 43
You can delete column like this:
df.drop("column Name).columns
In your case :
df.drop("id").columns
If you want to drop more than one column you can do:
dfWithLongColName.drop("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME")
Upvotes: 0
Reputation: 509
Consider 2 dataFrames:
>>> aDF.show()
+---+----+
| id|datA|
+---+----+
| 1| a1|
| 2| a2|
| 3| a3|
+---+----+
and
>>> bDF.show()
+---+----+
| id|datB|
+---+----+
| 2| b2|
| 3| b3|
| 4| b4|
+---+----+
To accomplish what you are looking for, there are 2 ways:
1. Different joining condition. Instead of saying aDF.id == bDF.id
aDF.join(bDF, aDF.id == bDF.id, "outer")
Write this:
aDF.join(bDF, "id", "outer").show()
+---+----+----+
| id|datA|datB|
+---+----+----+
| 1| a1|null|
| 3| a3| b3|
| 2| a2| b2|
| 4|null| b4|
+---+----+----+
This will automatically get rid of the extra the dropping process.
2. Use Aliasing: You will lose data related to B Specific Id's in this.
>>> from pyspark.sql.functions import col
>>> aDF.alias("a").join(bDF.alias("b"), aDF.id == bDF.id, "outer").drop(col("b.id")).show()
+----+----+----+
| id|datA|datB|
+----+----+----+
| 1| a1|null|
| 3| a3| b3|
| 2| a2| b2|
|null|null| b4|
+----+----+----+
Upvotes: -1
Reputation: 7967
Adding to @Patrick's answer, you can use the following to drop multiple columns
columns_to_drop = ['id', 'id_copy']
df = df.drop(*columns_to_drop)
Upvotes: 170
Reputation: 844
You can use two way:
1: You just keep the necessary columns:
drop_column_list = ["drop_column"]
df = df.select([column for column in df.columns if column not in drop_column_list])
2: This is the more elegant way.
df = df.drop("col_name")
You should avoid the collect() version, because it will send to the master the complete dataset, it will take a big computing effort!
Upvotes: 25
Reputation: 361
An easy way to do this is to user "select
" and realize you can get a list of all columns
for the dataframe
, df
, with df.columns
drop_list = ['a column', 'another column', ...]
df.select([column for column in df.columns if column not in drop_list])
Upvotes: 36
Reputation: 1113
Maybe a little bit off topic, but here is the solution using Scala. Make an Array
of column names from your oldDataFrame
and delete the columns that you want to drop ("colExclude")
. Then pass the Array[Column]
to select
and unpack it.
val columnsToKeep: Array[Column] = oldDataFrame.columns.diff(Array("colExclude"))
.map(x => oldDataFrame.col(x))
val newDataFrame: DataFrame = oldDataFrame.select(columnsToKeep: _*)
Upvotes: 4
Reputation: 5433
You could either explicitly name the columns you want to keep, like so:
keep = [a.id, a.julian_date, a.user_id, b.quan_created_money, b.quan_created_cnt]
Or in a more general approach you'd include all columns except for a specific one via a list comprehension. For example like this (excluding the id
column from b
):
keep = [a[c] for c in a.columns] + [b[c] for c in b.columns if c != 'id']
Finally you make a selection on your join result:
d = a.join(b, a.id==b.id, 'outer').select(*keep)
Upvotes: 14