Reputation: 478
I am trying to batch rename my columns in PySpark from:
'collect_list(Target_Met_1)[1]' --> 'AB11'
'collect_list(Target_Met_1)[2]' --> 'AB12'
'collect_list(Target_Met_2)[1]' --> 'AB21'
'collect_list(Target_Met_1)[150]' --> 'AB150'
How do I go about it in a programmatically? Right now, I can manually change the names using:
df.withColumnRenamed('collect_list(Target_Met_1)[1]', 'AB11')
But if I have 500 columns, it's not efficient. I realize that an other way to renaming it would be using something like a udf, but I cannot figure the best possible approach.
I have split the columns and that's not the problem. The problem is around renaming the column.
Upvotes: 2
Views: 1134
Reputation: 1542
Something like this can help too. It's a rename function similar to the Pandas rename functionality.
def rename_cols(map_dict):
"""
Rename a bunch of columns in a data frame
:param map_dict: Dictionary of old column names to new column names
:return: Function for use in transform
"""
def _rename_cols(df):
for old, new in map_dict.items():
df = df.withColumnRenamed(old, new)
return df
return _rename_cols
And you can use it like
spark_df.transform(rename_cols(dict(old1='new1', old2='new2', old3='new3')))
Upvotes: 0
Reputation: 17824
To rename all columns you can use the method toDf
:
import re
df.toDF(*['AB' + ''.join(re.findall('\d+', i)) for i in df.columns])
Upvotes: 1
Reputation: 478
Never mind. Figured out. Essentially I had to use a list comprehension to rename the columns. I was splitting columns mentioned in the link above. Here is what that did the trick:
df = df.select('1', '2', '3', *[df[col][i].alias("AB" + str(i + 1) + col) for col in columns for i in range(max_dict[col])])
Upvotes: 1