user11568522
user11568522

Reputation:

Need to remove duplicate columns from a dataframe in pyspark

I have a dataframe with 432 columns and has 24 duplicate columns.

  1. df_tickets-->This has 432 columns
  2. duplicatecols--> This has the cols from df_tickets which are duplicate.

I want to remove the cols in df_tickets which are duplicate. So df_tickets should only have 432-24=408 columns.

I have tried this with the below code but its throwing error.

df_tickets.select([c for c in df_tickets.columns if c not in duplicatecols]).show()

The error is

An error occurred while calling o1657.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
HashAggregate(keys=[ms_bvoip_order_extension_id#953, ms_order_id#954...........

Can someone help me on this?

Upvotes: 1

Views: 16009

Answers (3)

jxc
jxc

Reputation: 13998

You might have to rename some of the duplicate columns in order to filter the duplicated. otherwise columns in duplicatecols will all be de-selected while you might want to keep one column for each. Below is one way which might help:

# an example dataframe
cols = list('abcaded')
df_ticket = spark.createDataFrame([tuple(i for i in range(len(cols)))], cols)
>>> df_ticket.show()
#+---+---+---+---+---+---+---+
#|  a|  b|  c|  a|  d|  e|  d|
#+---+---+---+---+---+---+---+
#|  0|  1|  2|  3|  4|  5|  6|
#+---+---+---+---+---+---+---+

# unless you just want to filter a subset of all duplicate columns
# this list is probably not useful
duplicatecols = list('ad')

# create cols_new so that seen columns will have a suffix '_dup'
cols_new = [] 
seen = set()
for c in df_ticket.columns:
    cols_new.append('{}_dup'.format(c) if c in seen else c)
    seen.add(c)

>>> cols_new
#['a', 'b', 'c', 'a_dup', 'd', 'e', 'd_dup']

Then filter the result based on the new column names

>>> df_ticket.toDF(*cols_new).select(*[c for c in cols_new if not c.endswith('_dup')]).show()
#+---+---+---+---+---+
#|  a|  b|  c|  d|  e|
#+---+---+---+---+---+
#|  0|  1|  2|  4|  5|
#+---+---+---+---+---+

This will keep the first of columns with the same column names. it should be an easy fix if you want to keep the last. Ideally, you should adjust column names before creating such dataframe having duplicated column names.

Upvotes: 7

michalrudko
michalrudko

Reputation: 1530

The solution below should get rid of duplicates plus preserve the column order of input df.

Load some sample data

df_tickets = spark.createDataFrame([(1,2,3,4,5)],['a','b','c','d','e'])
duplicatecols = spark.createDataFrame([(1,3,5)],['a','c','e'])

Check df schemas

df_tickets.printSchema()
root
 |-- a: long (nullable = true)
 |-- b: long (nullable = true)
 |-- c: long (nullable = true)
 |-- d: long (nullable = true)
 |-- e: long (nullable = true)

Remove duplicated columns

duplicatecols.printSchema()

root
 |-- a: long (nullable = true)
 |-- c: long (nullable = true)
 |-- e: long (nullable = true)

outCols = [x for x in df_tickets.columns if x not in set(duplicatecols.columns)]

df_tickets[outCols].show()

+---+---+
|  b|  d|
+---+---+
|  2|  4|
+---+---+

Upvotes: 0

Ankit Kumar Namdeo
Ankit Kumar Namdeo

Reputation: 1464

try this:

select_list = list(set(df_tickets.columns)-set(duplicatecols))
df_tickets.select(select_list).show()

let me know if this works for you or not.

Upvotes: -2

Related Questions