Reputation: 20200
I was reading this
Should source code be saved in UTF-8 format
and I am using the eclipse compiler lib but need to read some java source files in to feed it to that library. IT seems it can be stored in different formats from that post.
Is there one Charset I can use to read it in so it works every time. Charset.forName("UTF-8") maybe?
thanks, Dean
Upvotes: 5
Views: 7725
Reputation: 338785
Any tool can write Java source code in any encoding. Even the idea of .java file is not defined by the Java Language Spec. Any IDE can persist Java source code any way it wants† with any encoding.
The tools are responsible for ultimately providing a Unicode-compliant stream of characters into the compiler toolchain. How they collect and persist the source code is up to the particular tools.
The Java Language Specification states in Chapter 3 Lexical Structure:
Programs are written using the Unicode character set. Information about this character set and its associated character encodings may be found at http://www.unicode.org/.
So presumably a Java source code file would use one of character encodings common with Unicode such as UTF-8, UTF-16, or UCS-2.
Section 3.2 Lexical Translations mentions that a Java program could use an encoding such as ASCII by embedding Unicode escapes:
Unicode escape of the form \uxxxx, where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx.
While UTF-8 is common in my experience, that is not the only possible encoding. You must know or guess the encoding of any particular source file, and you must account for expanding any Unicode escapes.
By the way, note that at least in the Oracle JDK, the byte order mark (BOM) optional to UTF-8 files is not allowed in Java due to a bug (JDK-4508058) that will never be fixed (because of backward-compatibility concerns).
Also note that line terminators may vary: the ASCII characters CR (CARRIAGE RETURN), or LF (LINE FEED), or CR LF.
White space varies: SPACE (SP), CHARACTER TABULATION (HT) (horizontal tab), FORM FEED (FF), and line terminators.
Read the spec for additional details. For example, regarding the SUBSTITUTE character:
As a special concession for compatibility with certain operating systems, the ASCII SUB character (\u001a, or control-Z) is ignored if it is the last character in the escaped input stream.
Be sure you understand the basics of Unicode and of character encoding. Best place to start: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky.
† Even supposed rules such as “one public class per .java file” may be defined by particular tools rather than by Java itself. The CodeWarrior tools for Java way-back-when supported multiple classes per file.
Upvotes: 7