watts
watts

Reputation: 95

Rails 0mPG::Error: ERROR: invalid byte sequence for encoding "UTF8": 0xeda0bc

I'm running into an error while trying to write to tweets to my psql database.

I've searched the internet high and low (perhaps not well enough) for the answer, with no avail. I've looked at the answers here - but the suggestion was to convert the string to UTF8 (even though the response headers claim it's UTF-8 already).

I did so with this code:

# get the data from twitter
response = RestClient.get "http://search.twitter.com/search.json?rpp=100&since_id=238726971826253824&q=love"

# find the data encoding using CharDet
data = CharDet.detect(response.body)
encoding = data['encoding']

# create a new instance of Iconv with UTF-8 and then convert response.body
ic = Iconv.new('UTF-8//IGNORE', encoding)
converted_response = ic.iconv(response.body + '  ')[0..-2]

# take the data and convert it to JSON
response_json = ActiveSupport::JSON.decode(converted_response)


We then parse response_json and create tweets inside out database. However, when doing so, we get this error below.

  [4;36;1mSQL (0.1ms)[0m   [0;1mBEGIN[0m
  [4;35;1mSQL (0.0ms)[0m   [0mPG::Error: ERROR: invalid byte sequence for encoding "UTF8": 0xeda0bc
: INSERT INTO "tweets" ("from_user_id", "approved", "from_user", "has_profanity",    "twitter_search_id", "twitter_id", "posted_at", "updated_at", "iso_language_code", "profile_image_url", "text", "created_at", "archived", "geo", "to_user_id", "to_user", "metadata", "source", "event_id") VALUES(573857675, NULL, 'nataliekiro', NULL, 618, 238825898718162944, '2012-08-24 02:31:46.000000', '2012-08-24 02:32:05.166492', 'en', 'http://a0.twimg.com/profile_images/2341785780/image_normal.jpg', 'Happy Birthday @daughternumber1 🎂 Love You 😘', '2012-08-24 02:32:05.166492', 'f', NULL, 0, NULL, 
'--- !map:HashWithIndifferentAccess 
result_type: recent

I've gone ahead and tested the class of the response_json (returns Hash), even though at the end of that error it says HashWithIndifferentAccess.

Anyone else have similar issues & know of a solution?

Thanks!

Upvotes: 2

Views: 895

Answers (1)

watts
watts

Reputation: 95

I found a solution that worked! Not sure if it was the best of examples, as I'm new to Rails/Ruby - but it seems to have at least worked for the time being!

As you can see in my example above, I was trying to convert the entire response.body to UTF-8. This was proving to be unsuccessful.

In looking at the data that was being retrieved, the only portion that could have non UTF-8 entities would be the tweet status text. Twitter does not allow non a-z,-,_ characters in their display names. And since I'm only storing display names, status texts, and tweet ids, that leaves the status text. Looking at some of the statuses being pulled from Twitter - some users were using emoticons and such within their tweets.

The solution for me was to convert the individual status text to UTF-8, then re-assign it within the Hash.

def parse_response!
tweets_json = response_json['results'].reverse rescue []
tweets << tweets_json.collect do |tweet_json|

  # trying to fix encoding issue!
  data = CharDet.detect(tweet_json['text'])
  encoding = data['encoding']
  ic = Iconv.new('UTF-8//IGNORE', encoding)
  converted_response = ic.iconv(tweet_json['text'] + '  ')[0..-2]
  # after converting, put back into value
  tweet_json['text'] = converted_response

  # ... etc

Talk about a learning process!

Thanks @CraigRinger for your Help!

Upvotes: 1

Related Questions