Reputation: 1139
I have an Excel file, which contains a symbol (😘).
When I read this file in, all I get printed to the console is a square symbol with question mark inside it.
When I copy from either the console output, or the original file - some applications paste it correctly (the symbol), others instead just show "ᨅ".
This symbol is a multibyte character.
When I try to write this symbol out to a file, no matter which encoding I use, it is never displayed correctly in Notepad or Notepad++.
In my application I am seeing the symbol displayed as "ᨅ", but when working backwards I cannot seem to do anything with this symbol.
The Excel file can be found here.
The code I am using to read this file (using Apache POI) is below.
public static void main(final String[] args) throws Exception
{
final String desktop = System.getenv("USERPROFILE") + File.separator + "Desktop" + File.separator; //$NON-NLS-1$ //$NON-NLS-2$
// Read in input file from Desktop
try (Workbook workbook = WorkbookFactory.create(new File(desktop + "input.xlsx"), null, true)) //$NON-NLS-1$
{
// Get the first cell
final String str = workbook.getSheetAt(0).getRow(0).getCell(0).getStringCellValue();
// Print some analysis on the String
final String format = "%-30s%-25s%n"; //$NON-NLS-1$
System.out.format(format, "Representation of String", str); //$NON-NLS-1$
System.out.format(format, "Length of String", str.length()); //$NON-NLS-1$
System.out.format(format, "Code Point Length of String", str.codePointCount(0, str.length())); //$NON-NLS-1$
System.out.format(format, "Multi-byte Characters", Character.isSurrogatePair(str.charAt(0), str.charAt(1))); //$NON-NLS-1$
// Write out the String using all available encodings listed in java.nio.charset.StandardCharsets.
for (final Field field : StandardCharsets.class.getDeclaredFields())
{
if (field.getType().equals(Charset.class))
{
final Charset charset = (Charset) field.get(null);
Files.write(Paths.get(desktop + "out_" + charset.name() + ".txt"), str.getBytes(charset), StandardOpenOption.CREATE); //$NON-NLS-1$ //$NON-NLS-2$
}
}
}
}
Upvotes: 2
Views: 784
Reputation: 61870
As stated in the comments already, this is not a apache poi
problem but a Windows 7
and Notepad++
problem.
After all I have started my Windows 7
virtual machine once again (hopefully last time now ;-)). The Windows 7
system does not even providing a font having glyphs for that kind of Unicode code point. So it is pretty clear that nothing what relies on the system can show this 😘 emoji properly. Some Browsers (Firefox and Chrome for example) provide own fonts and so are able showing this. As @Jakg commented, seems IntelliJ
also does that while Eclipse
does not.
Other systems, Ubuntu
linux and Windows 10
are able showing the emoji properly. Of course Windows 10
always nor not in the CMD
command window. But inside of IDEs which have their own System.out
print writer even the System.out
shows it.
And while creating the text files one could make the detecting the encoding easier for text editors later by putting a BOM
(Byte Order Mark) as the first bytes of the text file. Some Charset
s (UTF-16
for example) does this per default but others not. Then we must providing this in the program.
So after the following code even Notepad++
should be able detecting the encoding correctly because all *.txt
files will have a BOM
.
import org.apache.poi.ss.usermodel.*;
import java.io.FileInputStream;
import java.nio.file.Paths;
import java.nio.file.Files;
import java.nio.file.StandardOpenOption;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
class ExcelReadUnicode {
public static void main(final String[] args) throws Exception {
try (Workbook workbook = WorkbookFactory.create(new FileInputStream("input.xlsx"))) {
String str = workbook.getSheetAt(0).getRow(0).getCell(0).getStringCellValue();
String format = "%-30s%-25s%n";
System.out.format(format, "Representation of String", str);
System.out.format(format, "Length of String", str.length());
System.out.format(format, "Code Point Length of String", str.codePointCount(0, str.length()));
System.out.format(format, "Multi-byte Characters", Character.isSurrogatePair(str.charAt(0), str.charAt(1)));
Charset[] charsets = new Charset[] {
StandardCharsets.UTF_16BE,
StandardCharsets.UTF_16LE,
StandardCharsets.UTF_8
}; // those charsets needs BOMs manually added
for (Charset charset : charsets ) {
Files.write(Paths.get("out_" + charset.name() + ".txt"),
("\uFEFF" + str).getBytes(charset), // write out a BOM before the content
StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING);
}
Files.write(Paths.get("out_" + StandardCharsets.UTF_16.name() + ".txt"),
str.getBytes(StandardCharsets.UTF_16), // UTF-16 puts BOM per default
StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING);
}
}
}
Btw.: The term "Charset" is misused for Unicode encodings. The Unicode character set is Unicode always and is always the same. The different UTF
(Unicode Transformation Format) are different encoding formats for that one Unicode character set.
Upvotes: 1