Reputation: 5168
Recently I wanted to do Spark Machine Learning Lab from Spark Summit 2016. Training video is here and exported notebook is available here.
The dataset used in the lab can be downloaded from UCI Machine Learning Repository. It contains a set of readings from various sensors in a gas-fired power generation plant. The format is xlsx file with five sheets.
To use the data in the lab I needed to read all the sheets form the Excel file and to concatenate them into one Spark DataFrame. During the training they are using Databricks Notebook but I was using IntelliJ IDEA with Scala and evaluating the code in the console.
The first step was to save all the Excel sheets into separate xlsx files named sheet1.xlxs
, sheet2.xlsx
etc. and put them into sheets
directory.
How to read all the Excel files and concatenate them into one Apache Spark DataFrame?
Upvotes: 4
Views: 11945
Reputation: 21
Hope this Spark Scala code might help.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{Path, FileSystem}
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.spark.sql.execution.datasources.InMemoryFileIndex
import java.net.URI
def listFiles(basep: String, globp: String): Seq[String] = {
val conf = new Configuration(sc.hadoopConfiguration)
val fs = FileSystem.get(new URI(basep), conf)
def validated(path: String): Path = {
if(path startsWith "/") new Path(path)
else new Path("/" + path)
}
val fileCatalog = InMemoryFileIndex.bulkListLeafFiles(
paths = SparkHadoopUtil.get.globPath(fs, Path.mergePaths(validated(basep), validated(globp))),
hadoopConf = conf,
filter = null,
sparkSession = spark)
fileCatalog.flatMap(_._2.map(_.path))
}
val root = "/mnt/{path to your file directory}"
val globp = "[^_]*"
val files = listFiles(root, globp)
val paths=files.toVector
Loop the vector to read multiple files:
for (path <- paths) {
print(path.toString)
val df= spark.read.
format("com.crealytics.spark.excel").
option("useHeader", "true").
option("treatEmptyValuesAsNulls", "false").
option("inferSchema", "false").
option("addColorColumns", "false").
load(path.toString)
}
Upvotes: 0
Reputation: 2415
We need spark-excel library for this, can be obtained from
https://github.com/crealytics/spark-excel#scala-api
spark-shell --driver-class-path ./spark-excel_2.11-0.8.3.jar --master=yarn-client
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val sqlContext = new SQLContext(sc)
val document = "path to excel doc"
val dataDF = sqlContext.read .format("com.crealytics.spark.excel") .option("sheetName", "Sheet Name") .option("useHeader", "true") .option("treatEmptyValuesAsNulls", "false") .option("inferSchema", "false") .option("location", document) .option("addColorColumns", "false") .load(document)
That's all! now you can perform the Dataframe operation on the dataDF object.
Upvotes: 1
Reputation: 5168
For this I have used spark-excel package. It can be added to build.sbt file as : libraryDependencies += "com.crealytics" %% "spark-excel" % "0.8.2"
The code to execute in IntelliJ IDEA Scala Console was:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SparkSession, DataFrame}
import java.io.File
val conf = new SparkConf().setAppName("Excel to DataFrame").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val spark = SparkSession.builder().getOrCreate()
// Function to read xlsx file using spark-excel.
// This code format with "trailing dots" can be sent to IJ Scala Console as a block.
def readExcel(file: String): DataFrame = spark.read.
format("com.crealytics.spark.excel").
option("location", file).
option("useHeader", "true").
option("treatEmptyValuesAsNulls", "true").
option("inferSchema", "true").
option("addColorColumns", "False").
load()
val dir = new File("./data/CCPP/sheets")
val excelFiles = dir.listFiles.sorted.map(f => f.toString) // Array[String]
val dfs = excelFiles.map(f => readExcel(f)) // Array[DataFrame]
val ppdf = dfs.reduce(_.union(_)) // DataFrame
ppdf.count() // res3: Long = 47840
ppdf.show(5)
Console output:
+-----+-----+-------+-----+------+
| AT| V| AP| RH| PE|
+-----+-----+-------+-----+------+
|14.96|41.76|1024.07|73.17|463.26|
|25.18|62.96|1020.04|59.08|444.37|
| 5.11| 39.4|1012.16|92.14|488.56|
|20.86|57.32|1010.24|76.64|446.48|
|10.82| 37.5|1009.23|96.62| 473.9|
+-----+-----+-------+-----+------+
only showing top 5 rows
Upvotes: 5