Reputation: 19235
I have a CSV with headings that I'd like to save as Parquet (actually a delta table)
The column headings have spaces in them, which parquet can't handle. How do I change spaces to underscores?
This is what I have so far, cobbled together from other SO posts:
from pyspark.sql.functions import *
df = spark.read.option("header", True).option("delimiter","\u0001").option("inferSchema",True).csv("/mnt/landing/MyFile.TXT")
names = df.schema.names
for name in names:
df2 = df.withColumnRenamed(name,regexp_replace(name, ' ', '_'))
When I run this, the final line gives me this error:
TypeError: Column is not iterable
I thought this would be a common requirement given that parquet can't handle spaces but it's quite difficult to find any examples.
Upvotes: 2
Views: 1954
Reputation: 10382
You are getting exception because - function regexp_replace
returns of type Column
but function withColumnRenamed
is excepting of type String
.
def regexp_replace(e: org.apache.spark.sql.Column,pattern: String,replacement: String): org.apache.spark.sql.Column
def withColumnRenamed(existingName: String,newName: String): org.apache.spark.sql.DataFrame
Upvotes: 1
Reputation: 31540
Use .toDF
(or) .select
and pass list of columns to create new dataframe.
df.show()
#+---+----+----+
#| id|id a|id b|
#+---+----+----+
#| 1| a| b|
#| 2| c| d|
#+---+----+----+
new_cols=list(map(lambda x: x.replace(" ", "_"), df.columns))
df.toDF(*new_cols).show()
df.select([col(s).alias(s.replace(' ','_')) for s in df.columns]).show()
#+---+----+----+
#| id|id_a|id_b|
#+---+----+----+
#| 1| a| b|
#| 2| c| d|
#+---+----+----+
Upvotes: 0
Reputation: 87349
You need to use reduce
function to iteratively apply renaming to the dataframe, because in your code df2
will have only the last column renamed...
The code would look as following (instead of for
loop):
df2 = reduce(lambda data, name: data.withColumnRenamed(name, name.replace('1', '2')),
names, df)
Upvotes: 1