Reputation: 23119
I have a large Excel(xlsx and xls)
file with multiple sheet and I need convert it to RDD
or Dataframe
so that it can be joined to other dataframe
later. I was thinking of using Apache POI and save it as a CSV
and then read csv
in dataframe
. But if there is any libraries or API that can help in this Process would be easy. Any help is highly appreciated.
Upvotes: 19
Views: 133396
Reputation: 29227
Here are read and write examples to read from and write into excel with full set of options...
Source spark-excel from crealytics
Scala API Spark 2.0+:
Create a DataFrame from an Excel file
import org.apache.spark.sql._
val spark: SparkSession = ???
val df = spark.read
.format("com.crealytics.spark.excel")
.option("sheetName", "Daily") // Required
.option("useHeader", "true") // Required
.option("treatEmptyValuesAsNulls", "false") // Optional, default: true
.option("inferSchema", "false") // Optional, default: false
.option("addColorColumns", "true") // Optional, default: false
.option("startColumn", 0) // Optional, default: 0
.option("endColumn", 99) // Optional, default: Int.MaxValue
.option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
.option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files
.option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
.schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
.load("Worktime.xlsx")
Write a DataFrame to an Excel file
df.write
.format("com.crealytics.spark.excel")
.option("sheetName", "Daily")
.option("useHeader", "true")
.option("dateFormat", "yy-mmm-d") // Optional, default: yy-m-d h:mm
.option("timestampFormat", "mm-dd-yyyy hh:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss.000
.mode("overwrite")
.save("Worktime2.xlsx")
Note: Instead of sheet1 or sheet2 you can use their names as well.. in this example given above Daily is sheet name.
This package can be added to Spark using the --packages
command line option. For example, to include it when starting the spark shell:
$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.11:0.13.1
groupId: com.crealytics artifactId: spark-excel_2.11 version: 0.13.1
Tip : This is very useful approach particularly for writing maven test cases where you can place excel sheets with sample data in excel
src/main/resources
folder and you can access them in your unit test cases(scala/java), which createsDataFrame
[s] out of excel sheet...
A Spark datasource for the HadoopOffice library. This Spark datasource assumes at least Spark 2.0.1. However, the HadoopOffice library can also be used directly from Spark 1.x. Currently this datasource supports the following formats of the HadoopOffice library:
Excel Datasource format:
org.zuinnote.spark.office.Excel
Loading and Saving of old Excel (.xls) and new Excel (.xlsx) This datasource is available on Spark-packages.org and on Maven Central.
Since we dont have maven kind of in pyspark we should mention what packaged you want in the spark session it will download and will put it in ivy cache :
Here I am creating a sample data frame and saving it as excel
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
def main(output_path):
spark = SparkSession.builder \
.appName("Excel Writer") \
.config("spark.jars.packages", "com.crealytics:spark-excel_2.12:0.13.5") \
.getOrCreate()
schema = StructType([
StructField("ID", IntegerType(), True),
StructField("Name", StringType(), True),
StructField("Age", IntegerType(), True),
StructField("Country", StringType(), True)
])
data = [
(1, "Ram Ghadiyaram", 47, "USA"),
(2, "Adam", 31, "UK"),
(3, "Arindam", 25, "Canada"),
(4, "Rachel Zane", 29, "USA")
]
print("Creating a sample DataFrame...")
df = spark.createDataFrame(data, schema)
print("Sample DataFrame:")
df.show()
print("Writing DataFrame to Excel file...")
df.write.format("com.crealytics.spark.excel") \
.option("dataAddress", "'Sheet1'!A1") \
.option("header", "true") \
.option("addColorColumns", "true") \
.mode("overwrite") \
.save(output_path)
print(f"Excel file written to {output_path}")
spark.stop()
if __name__ == "__main__":
output_file = "sample_output.xlsx"
main(output_file)
log :
C:\Users\ramgh\AppData\Local\Microsoft\WindowsApps\python3.8.exe C:\Users\ramgh\Downloads\spark-3.1.2-bin-hadoop3.2\python_pyspark\Pyspark_excel.py
:: loading settings :: url = jar:file:/C:/Users/ramgh/Downloads/spark-3.1.2-bin-hadoop3.2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: C:\Users\ramgh\.ivy2\cache
The jars for the packages stored in: C:\Users\ramgh\.ivy2\jars
com.crealytics#spark-excel_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-dac60d07-afdd-430f-beb2-841d2c79578a;1.0
confs: [default]
found com.crealytics#spark-excel_2.12;0.13.5 in central
found org.apache.poi#poi;4.1.2 in central
found commons-codec#commons-codec;1.13 in central
found org.apache.commons#commons-collections4;4.4 in central
found org.apache.commons#commons-math3;3.6.1 in central
found com.zaxxer#SparseBitSet;1.2 in central
found org.apache.poi#poi-ooxml;4.1.2 in central
found org.apache.poi#poi-ooxml-schemas;4.1.2 in central
found org.apache.xmlbeans#xmlbeans;3.1.0 in central
found com.github.virtuald#curvesapi;1.06 in central
found com.norbitltd#spoiwo_2.12;1.7.0 in central
found org.scala-lang.modules#scala-xml_2.12;1.2.0 in local-m2-cache
found com.github.pjfanning#excel-streaming-reader;2.3.4 in central
found com.github.pjfanning#poi-shared-strings;1.0.4 in central
found com.h2database#h2;1.4.200 in central
found org.apache.commons#commons-text;1.8 in central
found org.apache.commons#commons-lang3;3.9 in local-m2-cache
found xml-apis#xml-apis;1.4.01 in central
found org.slf4j#slf4j-api;1.7.30 in local-m2-cache
found org.apache.commons#commons-compress;1.20 in central
found com.fasterxml.jackson.core#jackson-core;2.8.8 in central
:: resolution report :: resolve 5432ms :: artifacts dl 545ms
:: modules in use:
com.crealytics#spark-excel_2.12;0.13.5 from central in [default]
com.fasterxml.jackson.core#jackson-core;2.8.8 from central in [default]
com.github.pjfanning#excel-streaming-reader;2.3.4 from central in [default]
com.github.pjfanning#poi-shared-strings;1.0.4 from central in [default]
com.github.virtuald#curvesapi;1.06 from central in [default]
com.h2database#h2;1.4.200 from central in [default]
com.norbitltd#spoiwo_2.12;1.7.0 from central in [default]
com.zaxxer#SparseBitSet;1.2 from central in [default]
commons-codec#commons-codec;1.13 from central in [default]
org.apache.commons#commons-collections4;4.4 from central in [default]
org.apache.commons#commons-compress;1.20 from central in [default]
org.apache.commons#commons-lang3;3.9 from local-m2-cache in [default]
org.apache.commons#commons-math3;3.6.1 from central in [default]
org.apache.commons#commons-text;1.8 from central in [default]
org.apache.poi#poi;4.1.2 from central in [default]
org.apache.poi#poi-ooxml;4.1.2 from central in [default]
org.apache.poi#poi-ooxml-schemas;4.1.2 from central in [default]
org.apache.xmlbeans#xmlbeans;3.1.0 from central in [default]
org.scala-lang.modules#scala-xml_2.12;1.2.0 from local-m2-cache in [default]
org.slf4j#slf4j-api;1.7.30 from local-m2-cache in [default]
xml-apis#xml-apis;1.4.01 from central in [default]
:: evicted modules:
org.apache.commons#commons-compress;1.19 by [org.apache.commons#commons-compress;1.20] in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 22 | 1 | 1 | 1 || 21 | 0 |
---------------------------------------------------------------------
:: problems summary ::
:::: ERRORS
unknown resolver null
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
:: retrieving :: org.apache.spark#spark-submit-parent-dac60d07-afdd-430f-beb2-841d2c79578a
confs: [default]
0 artifacts copied, 21 already retrieved (0kB/73ms)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/09/27 14:00:45 WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped
Creating a sample DataFrame...
Sample DataFrame:
+---+--------------+---+-------+
| ID| Name|Age|Country|
+---+--------------+---+-------+
| 1|Ram Ghadiyaram| 47| USA|
| 2| Adam| 31| UK|
| 3| Arindam| 25| Canada|
| 4| Rachel Zane| 29| USA|
+---+--------------+---+-------+
Writing DataFrame to Excel file...
Excel file written to sample_output.xlsx
Upvotes: 14
Reputation: 21
Hope this should help.
val df_excel= spark.read.
format("com.crealytics.spark.excel").
option("useHeader", "true").
option("treatEmptyValuesAsNulls", "false").
option("inferSchema", "false").
option("addColorColumns", "false").load(file_path)
display(df_excel)
Upvotes: 1
Reputation: 227
I have used com.crealytics.spark.excel-0.11 version jar and created in spark-Java, it would be the same in scala too, just need to change javaSparkContext to SparkContext.
tempTable = new SQLContext(javaSparkContxt).read()
.format("com.crealytics.spark.excel")
.option("sheetName", "sheet1")
.option("useHeader", "false") // Required
.option("treatEmptyValuesAsNulls","false") // Optional, default: true
.option("inferSchema", "false") //Optional, default: false
.option("addColorColumns", "false") //Required
.option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff] .schema(schema)
.schema(schema)
.load("hdfs://localhost:8020/user/tester/my.xlsx");
Upvotes: 1
Reputation: 186
Alternatively, you can use the HadoopOffice library (https://github.com/ZuInnoTe/hadoopoffice/wiki), which supports also encrypted Excel documents and linked workbooks, amongst other features. Of course Spark is also supported.
Upvotes: 3
Reputation: 41987
The solution to your problem is to use Spark Excel
dependency in your project.
Spark Excel has flexible options
to play with.
I have tested the following code to read from excel
and convert it to dataframe
and it just works perfect
def readExcel(file: String): DataFrame = sqlContext.read
.format("com.crealytics.spark.excel")
.option("location", file)
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns", "False")
.load()
val data = readExcel("path to your excel file")
data.show(false)
you can give sheetname
as option
if your excel sheet has multiple sheets
.option("sheetName", "Sheet2")
I hope its helpful
Upvotes: 37