Rocky
Rocky

Reputation: 129

How to load HTML character codes data correctly into My Sql database?

I receive a data file in ETL from the client and we load the data into Mysql database using Load Data file functionality and use CHARACTER SET as utf8.

  LOAD DATA LOCAL INFILE '${filePath}' 
      INTO TABLE test_staging
      CHARACTER SET 'utf8'
      FIELDS TERMINATED BY '|' 
      LINES TERMINATED BY '\n' 
      (${testcolumns}) SET
      first_name = @first_name;

Data from client

1|"test"|"name"|2
2|"asdf"|asdf&test|2
3|fun|value|2

When I load the above data into the database and it is inserting directly as strings instead of converting to html characters

Database Data

id   first_name       last_name
1   "test" "name"
2   "asdf" asdf&test
3      fun                value

I tried changing the CHARACTER SET value from utf8 to latin1 but the result is same.

I also tried replacing the special characters while loading the data into database but the issue is, I receive all types of html characters data in the file. I cannot keep on adding the replace function for all of them.

  LOAD DATA LOCAL INFILE '${filePath}' 
      INTO TABLE test_staging
      CHARACTER SET 'utf8'
      FIELDS TERMINATED BY '|' 
      LINES TERMINATED BY '\n' 
      (${testcolumns}) SET
      first_name = REPLACE(REPLACE(REPLACE(first_name,''','\''),'"','"'),'&','&');

Is there any character set which converts the html data and loads correctly?

Expected Database Data

id  first_name  last_name
1   "test"       "name"
2   "asdf"       asdf&test
3    fun        value

Any help is appreciated... Thanks

Upvotes: 1

Views: 257

Answers (1)

Mikhail Antonov
Mikhail Antonov

Reputation: 1367

The problem you are facing is not about character set. It happens because the software that your client use intentionally converts HTML special characters to their codes.

It is probably possible to convert them back using MySQL though I couldn't find a quick solution, but as you are handling this data with ETL the better option seems to be to use the external tool before you insert the data into the database. One of these for example:

  • cat input-with-specialchars.html | recode html..ascii
  • xmlstarlet unesc
  • perl -MHTML::Entities -pe 'decode_entities($_);'

etc.

or something else depending on what tools you have available in your system or which ones you can afford to install.

Upvotes: 1

Related Questions