Reputation: 27
Can we make the code to skip unnecessary lines until the header row is found in excel and csv files by using python (any module)
Note : if anyone can help me in doing that in pyspark would be appreciated
In the above sample excel we need to skip the first 3 rows automatically and start reading the file from 4th line starting with "G/L" which is the main header line of the document.
Upvotes: 0
Views: 662
Reputation: 1148
I would advise to use Pandas to read the CSV and XLSX files as it has the easiest interface and later on convert it to PySpark. I also added two alternatives that you can try out depending on your setup and preferences.
Direct Pandas
pandas_df = pd.read_excel('path/file.xlsx', sheet_name='Table 1', skiprows=3)
skiprows=3
defines, how many lines shall be skipped before Pandas starts to read the file.
Documentation
PySpark
If you want to use PySpark directly, you'll need to install two libraries on your cluster:
com.crealytics:spark-excel_2.12:0.13.7
(from Maven)xlrd
(from PyPI)You can use then spark.read
as follows:
spark_df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.option("dataAddress", "'Table 1'!A4") \
.load(filePath)
Here can you tell spark to start in field A4
in the option dataAddress
.
PySpark over Pandas
Using the simple interface but directly converting to PySpark is also possible since Pyspark 3.2.0. You can use the syntax as follows:
spark_df = spark.pandas.read_excel('path/file.xlsx', sheet_name='Table 1', skiprows=3)
Upvotes: 1