Reputation: 478

How do I batch rename columns in pyspark efficiently?

I am trying to batch rename my columns in PySpark from:

 'collect_list(Target_Met_1)[1]' --> 'AB11'
 'collect_list(Target_Met_1)[2]' --> 'AB12'
 'collect_list(Target_Met_2)[1]' --> 'AB21'
 'collect_list(Target_Met_1)[150]' --> 'AB150'

How do I go about it in a programmatically? Right now, I can manually change the names using:

df.withColumnRenamed('collect_list(Target_Met_1)[1]', 'AB11')

But if I have 500 columns, it's not efficient. I realize that an other way to renaming it would be using something like a udf, but I cannot figure the best possible approach.

I have split the columns and that's not the problem. The problem is around renaming the column.

Upvotes: 2

Answers (3)

pettinato

Reputation: 1542

Something like this can help too. It's a rename function similar to the Pandas rename functionality.

def rename_cols(map_dict):
  """
  Rename a bunch of columns in a data frame
  :param map_dict: Dictionary of old column names to new column names
  :return: Function for use in transform
  """
  def _rename_cols(df):
    for old, new in map_dict.items():
      df = df.withColumnRenamed(old, new)
    return df
  return _rename_cols

And you can use it like

spark_df.transform(rename_cols(dict(old1='new1', old2='new2', old3='new3')))

Upvotes: 0

Mykola Zotko

Reputation: 17824

To rename all columns you can use the method toDf:

import re

df.toDF(*['AB' + ''.join(re.findall('\d+', i)) for i in df.columns])

Upvotes: 1

Rob

Reputation: 478

Never mind. Figured out. Essentially I had to use a list comprehension to rename the columns. I was splitting columns mentioned in the link above. Here is what that did the trick:

df = df.select('1', '2', '3', *[df[col][i].alias("AB" + str(i + 1) + col) for col in columns for i in range(max_dict[col])])

Upvotes: 1

How do I batch rename columns in pyspark efficiently?

Answers (3)

Related Questions