Reputation: 1

PySpark - Read CSV and ignore file header (not using pandas)

I have a problem that I hope you can help me with.
The text file that looks like this:

Report Name : 
column1,column2,column3
this is row 1,this is row 2, this is row 3

I am leveraging Synapse Notebooks to try to read this file into a dataframe. If I try to read the csv file using spark.read.csv() it thinks that the column name is "Report Name : ", which is obviously incorrect. I know that the Pandas csv reader has a 'skipRows[1]' function but unfortunately I cannot read the file directly with Pandas, as I am getting some strange networking errors. I can however convert a PySpark dataframe to a Pandas dataframe via: df.toPandas() I'd like to be able to solve this with straight PySpark dataframes.

Surely someone else has encountered this issue! Help!

I have tried every variation of reading files, and drop, etc. but the schema has already been defined when the first dataframe was created, with 1 column (Report Name : ). Not sure what to do now..

Upvotes: 0

Answers (2)

data_engineer_eric

Reputation: 1

Microsoft got back to me with an answer that worked! When using pandas csv reader, and you use the path to the source file you want to read. It requires an endpoint to blob storage (not adls gen2). I only had an endpoint that read dfs in the URI and not blob. After I added the endpoint to blob storage, the pandas reader worked great! Thanks for looking at my thread.

Upvotes: 0

Raid

Reputation: 180

Copied answer from similar question: How to skip lines while reading a CSV file as a dataFrame using PySpark?

import csv
from pyspark.sql.types import StringType

df = sc.textFile("test.csv")\
           .mapPartitions(lambda line: csv.reader(line,delimiter=',', quotechar='"')).filter(lambda line: len(line)>=2 and line[0]!= 'column1')\
           .toDF(['column1','column2','column3'])

Upvotes: 0

PySpark - Read CSV and ignore file header (not using pandas)

Answers (2)

Related Questions