junaedcse
junaedcse

Reputation: 43

How to convert an apache poi workbook to a spark dataframe?

I am reading a blob of an excel file from the database. Each file contains only one sheet in it. Using Apache poi I am converting it into a workbook like below.

val wb = w.getWorkbook(blob)

Now I want to convert it to a spark dataframe to process it. What is the best way to do it directly without saving it to disk?

I want to convert it in scala like,

val df: DataFrame = workbookToDf(org.apache.poi.ss.usermodel.Workbook)

Upvotes: 2

Views: 806

Answers (1)

chlebek
chlebek

Reputation: 2451

Just create iterator,

Iterator<Row> rowIterator = w.getWorkbook(blob).getSheetAt(1).iterator;

iterate over sheet and put cells values into 2d array,

Object[][] tab = new Object[rows][cols];

.iterator() iterates over row, inside this loop you have to create cell iterator to iterate over "columns" too.

Object[][] tab = new Object[rows][cols];

Iterator<Row> rowIterator = w.getWorkbook(blob).getSheetAt(1).iterator;

while (rowIterator.hasNext()) {
    Row row = rowIterator.next();
    // For each row, iterate through all the columns
    Iterator<Cell> cellIterator = row.cellIterator();
    while (cellIterator.hasNext()) {
        Cell cell = cellIterator.next();
        // Check the cell type and format accordingly
        cell.setCellType(CellType.STRING);
        tab[cell.getRowIndex()][cell.getColumnIndex()] = cell.getStringCellValue();
    }
}

then convert your 2d array to Seq/List/Array of Strings, where String will be concatenated elements of rows. In next step create RDD using sc.parallelize, in map step split your string and create RDD[Row]. I will look like this:

val yourRDD = rdd.map(RowFactory.create(_.split(yourDelimiter)))

and then use spark.createDataFrame(yourRDD,schema)

Upvotes: 2

Related Questions