Nick.Mc
Nick.Mc

Reputation: 19235

Rename dataframe columns in spark python

I have a CSV with headings that I'd like to save as Parquet (actually a delta table)

The column headings have spaces in them, which parquet can't handle. How do I change spaces to underscores?

This is what I have so far, cobbled together from other SO posts:

from pyspark.sql.functions import *

df = spark.read.option("header", True).option("delimiter","\u0001").option("inferSchema",True).csv("/mnt/landing/MyFile.TXT")

names = df.schema.names
for name in names:
    df2 = df.withColumnRenamed(name,regexp_replace(name, ' ', '_'))

When I run this, the final line gives me this error:

TypeError: Column is not iterable

I thought this would be a common requirement given that parquet can't handle spaces but it's quite difficult to find any examples.

Upvotes: 2

Views: 1954

Answers (3)

s.polam
s.polam

Reputation: 10382

You are getting exception because - function regexp_replace returns of type Column but function withColumnRenamed is excepting of type String.

def regexp_replace(e: org.apache.spark.sql.Column,pattern: String,replacement: String): org.apache.spark.sql.Column
def withColumnRenamed(existingName: String,newName: String): org.apache.spark.sql.DataFrame

Upvotes: 1

notNull
notNull

Reputation: 31540

Use .toDF (or) .select and pass list of columns to create new dataframe.

df.show()
#+---+----+----+
#| id|id a|id b|
#+---+----+----+
#|  1|   a|   b|
#|  2|   c|   d|
#+---+----+----+
new_cols=list(map(lambda x: x.replace(" ", "_"), df.columns))

df.toDF(*new_cols).show()

df.select([col(s).alias(s.replace(' ','_')) for s in df.columns]).show()
#+---+----+----+
#| id|id_a|id_b|
#+---+----+----+
#|  1|   a|   b|
#|  2|   c|   d|
#+---+----+----+

Upvotes: 0

Alex Ott
Alex Ott

Reputation: 87349

You need to use reduce function to iteratively apply renaming to the dataframe, because in your code df2 will have only the last column renamed...

The code would look as following (instead of for loop):

df2 = reduce(lambda data, name: data.withColumnRenamed(name, name.replace('1', '2')), 
             names, df)

Upvotes: 1

Related Questions