Reputation: 136
I have a file which contains UTF-8 data. This file does not have any BOM (Byte order mark) nor any length/size information as prefix for each unicode word/line.
I want to read bytes (yes bytes!), from a given offset and length. If the API has functions like seek, read bytes, or read bytes from an offset, it would be really helpful.
Example Content - "100° Info", For this content length is 9, If i request to read 9 bytes, it should read everything. Currently it's reading only 8. It looks like the API is treating the Unicode character as 2 chars.
How to read the content correctly? Which API to use for the same?
Upvotes: 0
Views: 3282
Reputation: 16399
But the Unicode character for degrees actually is two bytes when encoded as UTF-8. A degree symbol is represented by the bytes c2 b0
. You can use RandomAccessFile
in Java if you really want to read bytes at specific offsets in a file, but I doubt that's what you really want.
Probably the easiest way to do what it seems you want is to use a FileReader
and either read into an array of char of size 9, or read just 9 characters into a larger char array. For example:
try (Reader reader = new InputStreamReader(new FileInputStream(filename), "UTF-8")) {
char[] buffer = new char[1024];
reader.read(buffer, 0, 9);
}
Upvotes: 2
Reputation: 1155
You can of course read the content into a string and then use String.getBytes("UTF8") to get the bytes for a given string. This would return all 9 bytes in your outlined case.
Upvotes: 0
Reputation: 4093
I have a feeling you are confusing characters and bytes. The text 100° Info
has nine characters but that would be ten bytes due to the degrees symbol being stored as two bytes. If you read nine bytes you would miss the o
from Info
but this would still parse as a string since it's a single byte character.
Upvotes: 0