Reputation: 465
I have 100 Excel (*.xlsx) files stored in HDFS. The 100 *.xlsx files are organized into 10 directories, as shown below:
/user/cloudera/raw_data/dataPoint1/dataPoint.xlsx
/user/cloudera/raw_data/dataPoint2/dataPoint.xlsx
...
..
.
/user/cloudera/raw_data/dataPoint10/dataPoint.xlsx
Reading in one of *.xlsx files from above using
rawData = sc.textFile("/user/cloudera/raw_data/dataPoint1/dataPoint.xlsx")
threw gibberish data!
One obvious suggestion I received was to use the Gnumeric spreadsheet application's command-line utility called ssconvert:
$ ssconvert dataPoint.xlsx dataPoint.csv
and then dump it into the HDFS, so I can read the *.csv file directly. But that is not what I am trying to solve or is the requirement.
Solutions in Python
(preferable) and Java
would be appreciated. I am a rookie, so a detailed walkthrough would be really helpful.
Thanks in advance.
Upvotes: 3
Views: 9159
Reputation: 1025
You can use Spark Excel Library for converting xlsx files to DataFrames directly. See this answer with a detailed example.
As of version 0.8.4, the library does not support streaming and loads all the source rows into memory for conversion.
Upvotes: 1
Reputation: 186
You can try the HadoopOffice library: https://github.com/ZuInnoTe/hadoopoffice/wiki
Works with Spark and if you can use the Spark2 data source API you can use also Python. If you cannot use the Spark2 data source API then you can use the standard Spark APIs for reading/writing files using a HadoopFile Format provided by the HadoopOffice library.
Upvotes: 0
Reputation: 73
Use the following code to read excel files in Spark directly from HDFS using Hadoop FileSystem API. However you have to implement Apache POI API to parse the data
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import java.util.Date
import scala.io.Source
import java.io.{ InputStream, FileInputStream, File }
import org.apache.poi.hssf.usermodel.HSSFWorkbook
import org.apache.poi.ss.usermodel.{ Cell, Row, Workbook, Sheet }
import org.apache.poi.xssf.usermodel._
import scala.collection.JavaConversions._
import org.apache.poi.ss.usermodel.WorkbookFactory
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.Path;
import java.net._
object Excel {
def main(arr: Array[String]) {
val conf = new SparkConf().setAppName("Excel-read-write").setMaster("local[2]")
val sc = new SparkContext(conf)
val fs = FileSystem.get(URI.create("hdfs://localhost:9000/user/files/timetable.xlsx"),new Configuration());
val path= new Path("hdfs://localhost:9000/user/files/timetable.xlsx");
val InputStream = fs.open(path)
read(InputStream)
}
def read(in:InputStream)={
}
}
read(in:InputStream) method is where you implement Apache POI API to parse the data.
Upvotes: 2
Reputation: 848
If you are willing to build yourself a custom XLSX to CSV Converter, The Apache POI Event API would be Ideal for this. This API is suitable for Spreadsheets with large memory footprints. Look out what is it about here. Here is an example XSLX processing with the XSSF Event code
Upvotes: 0