Reputation:
I have a dataframe with 432 columns and has 24 duplicate columns.
I want to remove the cols in df_tickets which are duplicate. So df_tickets should only have 432-24=408 columns.
I have tried this with the below code but its throwing error.
df_tickets.select([c for c in df_tickets.columns if c not in duplicatecols]).show()
The error is
An error occurred while calling o1657.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
HashAggregate(keys=[ms_bvoip_order_extension_id#953, ms_order_id#954...........
Can someone help me on this?
Upvotes: 1
Views: 16009
Reputation: 13998
You might have to rename some of the duplicate columns in order to filter the duplicated. otherwise columns in duplicatecols will all be de-selected while you might want to keep one column for each. Below is one way which might help:
# an example dataframe
cols = list('abcaded')
df_ticket = spark.createDataFrame([tuple(i for i in range(len(cols)))], cols)
>>> df_ticket.show()
#+---+---+---+---+---+---+---+
#| a| b| c| a| d| e| d|
#+---+---+---+---+---+---+---+
#| 0| 1| 2| 3| 4| 5| 6|
#+---+---+---+---+---+---+---+
# unless you just want to filter a subset of all duplicate columns
# this list is probably not useful
duplicatecols = list('ad')
# create cols_new so that seen columns will have a suffix '_dup'
cols_new = []
seen = set()
for c in df_ticket.columns:
cols_new.append('{}_dup'.format(c) if c in seen else c)
seen.add(c)
>>> cols_new
#['a', 'b', 'c', 'a_dup', 'd', 'e', 'd_dup']
Then filter the result based on the new column names
>>> df_ticket.toDF(*cols_new).select(*[c for c in cols_new if not c.endswith('_dup')]).show()
#+---+---+---+---+---+
#| a| b| c| d| e|
#+---+---+---+---+---+
#| 0| 1| 2| 4| 5|
#+---+---+---+---+---+
This will keep the first of columns with the same column names. it should be an easy fix if you want to keep the last. Ideally, you should adjust column names before creating such dataframe having duplicated column names.
Upvotes: 7
Reputation: 1530
The solution below should get rid of duplicates plus preserve the column order of input df.
df_tickets = spark.createDataFrame([(1,2,3,4,5)],['a','b','c','d','e'])
duplicatecols = spark.createDataFrame([(1,3,5)],['a','c','e'])
df_tickets.printSchema()
root
|-- a: long (nullable = true)
|-- b: long (nullable = true)
|-- c: long (nullable = true)
|-- d: long (nullable = true)
|-- e: long (nullable = true)
duplicatecols.printSchema()
root
|-- a: long (nullable = true)
|-- c: long (nullable = true)
|-- e: long (nullable = true)
outCols = [x for x in df_tickets.columns if x not in set(duplicatecols.columns)]
df_tickets[outCols].show()
+---+---+
| b| d|
+---+---+
| 2| 4|
+---+---+
Upvotes: 0
Reputation: 1464
try this:
select_list = list(set(df_tickets.columns)-set(duplicatecols))
df_tickets.select(select_list).show()
let me know if this works for you or not.
Upvotes: -2