Reputation: 551
I have a dataframe that looks like this:
# +----+------+---------+
# |col1| col2 | col3 |
# +----+------+---------+
# | id| name | val |
# | 1 | a01 | X |
# | 2 | a02 | Y |
# +---+-------+---------+
I need to create a new dataframe from it, using row[1] as the new column headers and ignoring or dropping the col1, col2, etc. row. The new table should look like this:
# +----+------+---------+
# | id | name | val |
# +----+------+---------+
# | 1 | a01 | X |
# | 2 | a02 | Y |
# +---+-------+---------+
The columns can be variable, so I can't use the names to set them explicitly in the new dataframe. This is not using pandas df's.
Upvotes: 2
Views: 8408
Reputation: 21
Thanks to @Sai Kiran!
The header=True
works for me:
df = spark.read.csv("TSCAINV_062020.csv",header=True)
Upvotes: 0
Reputation: 37
Did you try this? header=True
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.getOrCreate()
df = spark.read.csv("TSCAINV_062020.csv",header=True)
Pyspark sets the column names as _c0, _c1, _c2 if the header is not set to True and it pushes the column down by one row.
Upvotes: 3
Reputation: 41957
Assuming that there is only one row with id
in col1, name
in col2 and val
in col3, you can use the following logic (commented for clarity and explanation)
#select the row with the header name
header = df.filter((df['col1'] == 'id') & (df['col2'] == 'name') & (df['col3'] == 'val'))
#selecting the rest of the rows except the first one
restDF = df.subtract(header)
#converting the header row into Row
headerColumn = header.first()
#looping columns for renaming
for column in restDF.columns:
restDF = restDF.withColumnRenamed(column, headerColumn[column])
restDF.show(truncate=False)
this should give you
+---+----+---+
|id |name|val|
+---+----+---+
|1 |a01 |X |
|2 |a02 |Y |
+---+----+---+
But the best option would be read it with header option set to true while reading the dataframe using sqlContext from source
Upvotes: 5