JackPGreen
JackPGreen

Reputation: 1139

Multibyte Character not displayed correctly when loaded from Excel File using Apache POI in Java

I have an Excel file, which contains a symbol (😘).

When I read this file in, all I get printed to the console is a square symbol with question mark inside it.

When I copy from either the console output, or the original file - some applications paste it correctly (the symbol), others instead just show "ᨅ".

This symbol is a multibyte character.

When I try to write this symbol out to a file, no matter which encoding I use, it is never displayed correctly in Notepad or Notepad++.

In my application I am seeing the symbol displayed as "ᨅ", but when working backwards I cannot seem to do anything with this symbol.

The Excel file can be found here.

The code I am using to read this file (using Apache POI) is below.

public static void main(final String[] args) throws Exception
{
    final String desktop = System.getenv("USERPROFILE") + File.separator + "Desktop" + File.separator; //$NON-NLS-1$ //$NON-NLS-2$

    // Read in input file from Desktop
    try (Workbook workbook = WorkbookFactory.create(new File(desktop + "input.xlsx"), null, true)) //$NON-NLS-1$
    {
        // Get the first cell
        final String str = workbook.getSheetAt(0).getRow(0).getCell(0).getStringCellValue();

        // Print some analysis on the String
        final String format = "%-30s%-25s%n"; //$NON-NLS-1$
        System.out.format(format, "Representation of String", str); //$NON-NLS-1$
        System.out.format(format, "Length of String", str.length()); //$NON-NLS-1$
        System.out.format(format, "Code Point Length of String", str.codePointCount(0, str.length())); //$NON-NLS-1$
        System.out.format(format, "Multi-byte Characters", Character.isSurrogatePair(str.charAt(0), str.charAt(1))); //$NON-NLS-1$

        // Write out the String using all available encodings listed in java.nio.charset.StandardCharsets.
        for (final Field field : StandardCharsets.class.getDeclaredFields())
        {
            if (field.getType().equals(Charset.class))
            {
                final Charset charset = (Charset) field.get(null);

                Files.write(Paths.get(desktop + "out_" + charset.name() + ".txt"), str.getBytes(charset), StandardOpenOption.CREATE); //$NON-NLS-1$ //$NON-NLS-2$
            }
        }
    }
}

Upvotes: 2

Views: 784

Answers (1)

Axel Richter
Axel Richter

Reputation: 61870

As stated in the comments already, this is not a apache poi problem but a Windows 7 and Notepad++ problem.

After all I have started my Windows 7 virtual machine once again (hopefully last time now ;-)). The Windows 7 system does not even providing a font having glyphs for that kind of Unicode code point. So it is pretty clear that nothing what relies on the system can show this 😘 emoji properly. Some Browsers (Firefox and Chrome for example) provide own fonts and so are able showing this. As @Jakg commented, seems IntelliJ also does that while Eclipse does not.

Other systems, Ubuntu linux and Windows 10 are able showing the emoji properly. Of course Windows 10 always nor not in the CMD command window. But inside of IDEs which have their own System.out print writer even the System.out shows it.

And while creating the text files one could make the detecting the encoding easier for text editors later by putting a BOM (Byte Order Mark) as the first bytes of the text file. Some Charsets (UTF-16 for example) does this per default but others not. Then we must providing this in the program.

So after the following code even Notepad++ should be able detecting the encoding correctly because all *.txt files will have a BOM.

import org.apache.poi.ss.usermodel.*;

import java.io.FileInputStream;

import java.nio.file.Paths;
import java.nio.file.Files;

import java.nio.file.StandardOpenOption;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;

class ExcelReadUnicode {

 public static void main(final String[] args) throws Exception {

  try (Workbook workbook = WorkbookFactory.create(new FileInputStream("input.xlsx"))) {

   String str = workbook.getSheetAt(0).getRow(0).getCell(0).getStringCellValue();

   String format = "%-30s%-25s%n";
   System.out.format(format, "Representation of String", str);
   System.out.format(format, "Length of String", str.length());
   System.out.format(format, "Code Point Length of String", str.codePointCount(0, str.length()));
   System.out.format(format, "Multi-byte Characters", Character.isSurrogatePair(str.charAt(0), str.charAt(1)));

   Charset[] charsets = new Charset[] {
                               StandardCharsets.UTF_16BE, 
                               StandardCharsets.UTF_16LE, 
                               StandardCharsets.UTF_8
                              }; // those charsets needs BOMs manually added

   for (Charset charset : charsets ) {
    Files.write(Paths.get("out_" + charset.name() + ".txt"), 
                ("\uFEFF" + str).getBytes(charset), // write out a BOM before the content
                StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING); 
   }

   Files.write(Paths.get("out_" + StandardCharsets.UTF_16.name() + ".txt"), 
               str.getBytes(StandardCharsets.UTF_16), // UTF-16 puts BOM per default
               StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING); 
  }

 }

}

Btw.: The term "Charset" is misused for Unicode encodings. The Unicode character set is Unicode always and is always the same. The different UTF (Unicode Transformation Format) are different encoding formats for that one Unicode character set.

Upvotes: 1

Related Questions