Calvin Froedge
Calvin Froedge

Reputation: 16373

Scraping - character encoding

I'm scraping some data from large tables on the web to populate a database. Some of the characters show up fine on my screen but do stuff like this when I scrape: ! Åland Islands

I'm using file_get_contents to grab the raw data. It looks fine after I've scraped it (ie if I just var_dump the raw result): Åland Islands

I then turn the data into an array and write it to a text file or sql file. What do I need to do to preserve the character formatting?

Upvotes: 1

Views: 474

Answers (1)

Jukka K. Korpela
Jukka K. Korpela

Reputation: 201896

When “Å” is turned to “!¬†√Ö” (five characters), then it is most probably a consequence of two or more incorrect character code conversions. A single incorrect conversion tends to turn a character to a different character, or to some pair or maybe triplet of characters, but hardly five characters.

If things look OK after scraping when you dump it, then you need to find out which character encoding is in use and check how you are writing the data to a file. If the data is UTF-8 encoded, as I suspect (a compilation of geographic names around the world more or less needs to be), then the writing operation should be one that operates on UTF-8 data, and when inspecting the result written to a file, the inspecting software should read the data as UTF-8 encoded, too.

Upvotes: 1

Related Questions