gabriel119435
gabriel119435

Reputation: 6832

How ExcelExtractor decides to use comma or dot as thousand separator?

I have this static method that receives a file path from a .xls file and extracts a single String out of it. I'm using POI to retrieve this data. This is the code:

private static String readsIPCASheet() throws Exception {
    String filePath = unzipsIPCAfile();
    InputStream input = new FileInputStream(filePath);
    HSSFWorkbook wb = new HSSFWorkbook(new POIFSFileSystem(input));
    ExcelExtractor extractor = new ExcelExtractor(wb);
    extractor.setFormulasNotResults(true);
    extractor.setIncludeSheetNames(false);
    extractor.setIncludeBlankCells(true);
    String text = extractor.getText();
    extractor.close();
}

I discovered that using my local machine (located in Brazil) I receive the following data:

198.22  40.27   171.24  568.17  98.22   3,035.71

But when I run this code on my VPS (located in US) I receive this:

198,22  40,27   171,24  568,17  98,22   3.035,71

Using unix locale, I got this:

Remote:

wildfly@server:~$ locale
LANG=pt_BR.UTF-8
LANGUAGE=
LC_CTYPE=pt_BR.UTF-8
LC_NUMERIC=pt_BR.UTF-8
LC_TIME=pt_BR.UTF-8
LC_COLLATE="pt_BR.UTF-8"
LC_MONETARY=pt_BR.UTF-8
LC_MESSAGES="pt_BR.UTF-8"
LC_PAPER=pt_BR.UTF-8
LC_NAME=pt_BR.UTF-8
LC_ADDRESS=pt_BR.UTF-8
LC_TELEPHONE=pt_BR.UTF-8
LC_MEASUREMENT=pt_BR.UTF-8
LC_IDENTIFICATION=pt_BR.UTF-8
LC_ALL=

Local:

gabriel ~ $ locale
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE=pt_BR.UTF-8
LC_NUMERIC=pt_BR.UTF-8
LC_TIME=pt_BR.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=pt_BR.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=pt_BR.UTF-8
LC_NAME=pt_BR.UTF-8
LC_ADDRESS=pt_BR.UTF-8
LC_TELEPHONE=pt_BR.UTF-8
LC_MEASUREMENT=pt_BR.UTF-8
LC_IDENTIFICATION=pt_BR.UTF-8
LC_ALL=

Since I dealing with regex after this piece of code, I prefer to set a standard about this than solving two unnecessary situations. How can accomplish this?

Upvotes: 0

Views: 1014

Answers (1)

Axel Richter
Axel Richter

Reputation: 61925

The apache poi Excel extractors (HSSF as well as XSSF) are using DataFormatter for formatting numeric values. So DataFormatter decides to use comma or dot as thousand separator. At least in the last stable apache poi version 3.17, it does this decision dependent of the LocaleUtil.getUserLocale(). See DataFormatter.java:235ff.

So if you do setting LocaleUtil.setUserLocale before creating the extractor, then this locale will be used to decide to use comma or dot as thousand separator.

Example

...
LocaleUtil.setUserLocale(java.util.Locale.US);
ExcelExtractor extractor = new ExcelExtractor(wb);
...

uses US locale in DataFormatter and so uses dot as decimal separator and comma as thousand separator.

Upvotes: 3

Related Questions