Reputation: 642
I'm having some difficulty digesting the some concepts in the Java IO classes. For instance there are two types of streams, byte and char. Byte streams as I understand read byte by byte.
1. If a char in java is stored as a 16bit (two byte) data type, how is it possible for me to accurately read a char, say 'A', from a file using a byte oriented input stream eg. FileInputStream?
2. Is it that the chars I used (mostly between 0 and 122 on the ascii chart) are stored in one byte of the two bytes allocated?
3. DataInputStream/DataOutputStream allows me to read and write binary data, other input streams like FileInputStream/FileOutputStream allows me to read and write what exactly? I basically want to know which stream to use when I wish to output data as text I can read (using a simple text editor like notepad) versus when I want it encoded as raw binary data (text that looks like garbage in notepad)?
Struggling to understand the concept of streams in java and which to use when.
Upvotes: 4
Views: 3830
Reputation: 5598
Depends on the format of the file your are reading.
If the file is a stream of ASCII bytes, then do this:
InputStream is = new FileInputStream( filePath );
Reader reader = new InputStreamReader( is, "ISO-8859-1" );
char ch = reader.read();
You always first open the input stream on the byte oriented file. Then, the InputStreamReader will convert the bytes to characters. Of course, in this case, the ISO-8859-1 is a mapping from single byte values to the exact same character values. Clearly other mapping are possible, but ISO-8859-1 happens to be the same as the first 255 characters of the Unicode set, and the first 127 of those happen to be the same as ASCII.
When writing use:
OutputStream os = new FileOutputStream( filePath ) ;
Writer w = new OutputStreamWriter( os, "ISO-8859-1" );
w.write( ch );
Once again, is the the OutputStreamWriter that converts between characters and byte stream appropriately according to the ISO-8859-1 character set. The resulting file will have one byte per character.
Here are a few more examples of proper basic stream patterns.
If using the above you execute this:
w.write("AAAA");
w.flush();
w.close();
The resulting file will contain 4 bytes with the value 65 in each byte. Reading that file back in using the code at the top will result in four "A" characters in memory, but in memory they take up 16 bits for each char.
If the file is encoded in a different character set, including possibly multiple byte characters, then simply use the right encoding in the InputStreamReader/OutputStreamWriter and the proper conversion will take place while reading and writing.
UTF-8 is not a character set, but rather an encoding of the regular unicode characters into byte sequences, and it turns out that UTF-8 encoding is quite clever in that the first 127 characters of the unicode characters are mapped into the first 127 byte values (as single bytes by themselves). Then characters >= 128 make use of 2 or more byte values in a row, where each of those byte values is >= 128. If you know that the ascii file only uses "7-bit" ASCII, then UTF-8 will work for you as well. For Java in general UTF-8 is the best encoding to use for a file because it can encode all possible Java char values properly without loss.
Learning this about streams in very important. I recommend you do not try to convert bytes to characters in any other way. It is possible, of course, but it is a waste of effort since the conversions in the streams are very reliable and correct.
(It gets worse ... actually a Character is a 32 bit quantity, of which 20 bits can be encoded into sequences of the 16-bit char values with an encoding called UTF-16. Recommend you ignore that for now, but just be aware that even in a Java String which is composed of 16-bit char values there are some double-char sequences.)
Upvotes: 0
Reputation: 17923
Before I try to answer your question, there few very basic things to understand.
InputStream/OutputStream
), everything is bits and bytes. So the lowest level streams deal with raw data which is bits/bytes. UTF-8
). Now coming to your questions:
If a char in java is stored as a 16bit (two byte) data type, how is it possible for me to accurately read a char, say 'A', from a file using a byte oriented input stream eg. FileInputStream?
For reading the character data, the raw input streams are wrapped in character oriented streams, example
FileInputStream fis = new FileInputStream("test.txt");
InputStreamReader isr = new InputStreamReader(fis, "UTF8");
As the javadoc says InputStreamReader
is a bridge from byte streams to character streams .
Is it that the chars I used (mostly between 0 and 122 on the ascii chart) are stored in one byte of the two bytes allocated?
Yes. The ascii charset is a subset of the larger Unicode set like UTF-8
.
DataInputStream/DataOutputStream allows me to read and write binary data, other input streams like FileInputStream/FileOutputStream allows me to read and write what exactly?
I guess its evident now that DataInputStream/DataOutputStream
are for character data whereas ileInputStream/FileOutputStream
are for raw data.
I basically want to know which stream to use when I wish to output data as text I can read (using a simple text editor like notepad) versus when I want it encoded as raw binary data (text that looks like garbage in notepad)?
For text use any Readers/Writers (Here is an example)
Upvotes: 1
Reputation: 280054
If a char in java is stored as a 16bit (two byte) data type, how is it possible for me to accurately read a char, say 'A', from a file using a byte oriented input stream eg. FileInputStream?
Try doing
System.out.println(Integer.toBinaryString('A'));
which prints out the binary representation of the character 'A'
. This prints
1000001
Since 'A'
is a char
, it's actually stored with 16 bits
00000000 01000001
So all you have to do is read two sequential bytes and use them appropriately to form a char
. See that in action
ByteBuffer buffer = ByteBuffer.wrap(new byte[] {0b00000000, 0b01000001});
System.out.println(buffer.getChar());
which prints
A
What this does is take the first byte
in the array and use it as the first 8 bits in the char
and the second byte
as the last 8 bits.
DataInputStream/DataOutputStream allows me to read and write binary data, other input streams like FileInputStream/FileOutputStream allows me to read and write what exactly? I basically want to know which stream to use when I wish to output data as text I can read (using a simple text editor like notepad) versus when I want it encoded as raw binary data (text that looks like garbage in notepad)?
Whether you are writing text or anything else, it's all bits and bytes. You can very well do
"someString".getBytes()
and write those. So it doesn't really matter. Use what is most representative of what you are doing. Typically, you could wrap the underlying OutputStream
with a PrintWriter
and the underlying InputStream
with a Scanner
or BufferedReader
.
Upvotes: 1