activeDev
activeDev

Reputation: 795

dealing with character encoded twitter responses

Im building an application that interacts with the Twitter API.

So far my code handles the responses correctly and I am happy with the way i am interacting with search API. I am however stuck when it comes to the actual content from the Twitter API responses.

Right now, i search for tweets with specific hastags using the atom feed, i.e.

$url = 'http://search.twitter.com/search.atom?q='.urlencode($hash_tag) ;
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, TRUE);
$xml = curl_exec ($ch);
curl_close ($ch);

$twelement = new SimpleXMLElement($xml);

echo "<pre>";
foreach ($twelement->entry as $entry) {

echo($entry->author->name);
echo '<br />';
echo mb_detect_encoding($entry->author->name);
echo '<br />';

I have been trying different php functions to decode/convert to the correct character encoding, but no matter what i do, i always end up with the wrong output.

My output from this code is : (crossed out for privacy)

xxxxxx (xxxxx xxxxxxx)
ASCII

xxxx_xxxxx (Chinny ♥_♥)
UTF-8

kunlemyk ((˘̯˘ ) hardekhunley™)
UTF-8

xxxx_xxxxx (♥ify okwuosa♥)
UTF-8

xxx_xxxx (Call me DRO)
ASCII

Why are some ASCII and some UTF-8? how can i ensure they are consistent. can i convert them to ascii? im pretty lost here. I have been stuck on this for ages and would really appreciate some help here.

Regards,

Andrew

Upvotes: 2

Views: 1189

Answers (2)

goat
goat

Reputation: 31813

utf8 was specifically designed so that ascii was a proper subset of it. This was done for backwards compatibility.

a function that detects an encoding, usually does so by educated guessing after inspecting the byte values. If the string in question contains nothing but ascii characters, it could be called either ascii, or utf8. Again, this is because an ascii string is a valid utf8 string by design.

It makes more sense to call a pure ascii string "ascii", because it is more specific, and when guessing, you only really know for sure that it's ascii if all you've encountered was ascii chars. If there was at least one utf8 character in the string, and the rest were ascii, the func should detect it as utf8. But without seeing at least one utf8 char, it would be wrong to call the string utf8.

edit- as for what to do about it? Again, an ascii string is a valid utf8 string, so you should just use utf8 as that will work for both types. make sure to declare this via a real http header, not a <meta tag.

header('content-type:text/html;charset=utf-8');

Upvotes: 2

Bamdad Dashtban
Bamdad Dashtban

Reputation: 354

Take a loot at this post.

You might want to search for methods to detect encoding.

Upvotes: 0

Related Questions