Reputation: 21
How can I load an Excel file with multiple columns into a DataFrame using Spark’s Java API?
For example, if I wanted to read a CSV file, I would use:
Dataset<Row> df = spark_session.read()
.option("header", "true")
.csv(filePath);
Upvotes: 1
Views: 457
Reputation: 102813
You need an excel-reading library, such as Apache POI or JXLS. These projects have tutorials, API documentation, and you can search the web for tutorials (plenty exist; reading XLS files is a fairly popular pastime).
Note that excel is a closed, proprietary format. Microsoft says that xlsx files are an open standard. This is a fairly disgusting lie, intended to mislead government officials that lack proper technical knowledge to evaluate that preposterous claim. Both of these libraries are therefore reverse engineered attempts. They will mess up. If they do it is not their fault. Excel is not a suitable format for whatever it is you are doing. At best, you're applying workarounds to try and make it work, but, it will be, unavoidably, not nearly as stable as you'd like. The more complicated you make the excel sheet the more likely that JXLS/POI will simply crash or do the wrong thing.
If you're in a position to modify the process that leads to '... and now I need to read an excel sheet with java', then use this position and change it to anything else. For example, basic CSVs, databases, etc.
Upvotes: 1