Reputation: 43
I am reading a blob of an excel file from the database. Each file contains only one sheet in it. Using Apache poi I am converting it into a workbook like below.
val wb = w.getWorkbook(blob)
Now I want to convert it to a spark dataframe to process it. What is the best way to do it directly without saving it to disk?
I want to convert it in scala like,
val df: DataFrame = workbookToDf(org.apache.poi.ss.usermodel.Workbook)
Upvotes: 2
Views: 806
Reputation: 2451
Just create iterator,
Iterator<Row> rowIterator = w.getWorkbook(blob).getSheetAt(1).iterator;
iterate over sheet and put cells values into 2d array,
Object[][] tab = new Object[rows][cols];
.iterator() iterates over row, inside this loop you have to create cell iterator to iterate over "columns" too.
Object[][] tab = new Object[rows][cols];
Iterator<Row> rowIterator = w.getWorkbook(blob).getSheetAt(1).iterator;
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
// For each row, iterate through all the columns
Iterator<Cell> cellIterator = row.cellIterator();
while (cellIterator.hasNext()) {
Cell cell = cellIterator.next();
// Check the cell type and format accordingly
cell.setCellType(CellType.STRING);
tab[cell.getRowIndex()][cell.getColumnIndex()] = cell.getStringCellValue();
}
}
then convert your 2d array to Seq/List/Array of Strings, where String will be concatenated elements of rows. In next step create RDD using sc.parallelize, in map step split your string and create RDD[Row]
. I will look like this:
val yourRDD = rdd.map(RowFactory.create(_.split(yourDelimiter)))
and then use spark.createDataFrame(yourRDD,schema)
Upvotes: 2