Reputation: 15
So I've been dealing with this problem for over a months now and I also checked almost every possible related solution over here in and over google but I couldn't find anything that really solved my case. my problem is that i'm trying to download an html source from a website but what i'm getting in most cases is that some of the text shows some "?" characters in it,most likely beacuse the site is in Hebrew. Here's my code,
public static InputStream openHttpGetConnection(String url)
throws Exception {
InputStream inputStream = null;
HttpClient httpClient = new DefaultHttpClient();
HttpResponse httpResponse = httpClient.execute(new HttpGet(url));
inputStream = httpResponse.getEntity().getContent();
return inputStream;
}
public static String downloadSource(String url) {
int BUFFER_SIZE = 1024;
InputStream inputStream = null;
try {
inputStream = openHttpGetConnection(url);
} catch (Exception e) {
// TODO: handle exception
}
int bytesRead;
String str = "";
byte[] inpputBuffer = new byte[BUFFER_SIZE];
try {
while ((bytesRead = inputStream.read(inpputBuffer)) > 0) {
String read = new String(inpputBuffer, 0, bytesRead,"UTF-8");
str +=read;
}
} catch (Exception e) {
// TODO: handle exception
}
return str;
}
Thanks.
Upvotes: 0
Views: 4805
Reputation: 9328
Converting an InputStream
to a String entails specifying an encoding, just as you do at new String(inpputBuffer, 0, bytesRead,"UTF-8");
.
But your approach as several drawbacks.
When retreiving HTTP content, generally speaking, you can not know in advance what encoding will be used in the HTTP response. But HTTP provides a mechanism for specifying that, using the Content-Type header.
More specifically, your response object should have a Content-Type
"header", that has an "attribute" called encoding
. In the response, it should look something like :
Content-Type: text/html; encoding=UTF-8
You should use whatever is after the encoding=
part to transform your byte
s to char
s.
Seeing you seem to use Apache HTTPClient, their documentation states :
You can set the content type header for a request with the addRequestHeader method in each method and retrieve the encoding for the response body with the getResponseCharSet method.
If the response is known to be a String, you can use the getResponseBodyAsString method which will automatically use the encoding specified in the Content-Type header or ISO-8859-1 if no charset is specified..
Alternate way
If there is no Content-Type header, and if you know your content is HTML, then you can try to convert it as a String using some encoding (UTF or ISO Latin preferably), and try to find some content matching <meta charset="UTF-8">
, and use that as the charset. This should only be a fail-over.
Drawback number two is that you read any number of bytes from your stream, and try to convert it to a String, which may not be possible.
In practice, UTF-8 can encode some "characters" across several bytes. For example "é" can be encoded as 0xC3A9
. So say for example that the response consists of two "é" characters. If your first call to read
returns :
[c3, a9, c3]
Your conversion to a String using new String(byte[], off, enc)
will leave the last byte apart, because it does not match a valid UTF8 sequence.
Your following read will get what's left to read
[a9]
Which is (whatever that is) not a "é" character.
Bottom line : you can not convert even a valid UTF-8 sequence to byte using your pattern.
Going forward : you use HTTPClient, use their method of HTTP Response to String conversion. If you wish to do it yourself, the easy way is to copy your input to a byte array, and then convert the byte array. Something along the lines of (pseudo code) :
ByteArrayOutputStream responseContent = new ByteArrayOutputStream()
copyAllBytes(responseInputStream, responseContent)
byte[] rawResponse = responseContent.toByteArray();
String stringResponse = new String(rawResponse, encoding);
But you could also use a CharsetDecoder
if you want a fully streamed implementation (one that does not buffer the response fully into memory), or as @jas answers, wrap your inputStream to a reader and concatenate the output (preferably into a StringBuilder, which should be faster if a high number of concatenation is to occur).
Upvotes: 1
Reputation: 10865
To read characters from a byte stream with a given encoding, use a Reader
. In your case it would be something like:
InputStreamReader isr = new InputStreamReader(inpputStream, "UTF-8");
char[] inputBuffer = new char[BUFFER_SIZE];
while ((charsRead = isr.read(inputBuffer, 0, BUFFER_SIZE)) > 0) {
String read = new String(inputBuffer, 0, charsRead);
str += read;
}
You can see that the bytes will be read in directly as characters --- it's the reader's problem to know if it needs to read one or two bytes, e.g., to create the character in the buffer. It's basically your approach but decoding as the bytes are being read in, instead of after.
Upvotes: 3