Reputation: 747
I have a perl script that reads from a web service and saves in a mysql table. this table uses latin1. from the web service there are coming some wrong characters and need to remove them before saving them in the database, otherwise they get saved as '?'
wanted to do something similar as: $desc=~s///gsi;
but is not removing them.
the webservice that has the wrong characters is: https://jobvacancies.services.businesslink.gov.uk:8443/vacancy/26653478
using a user agent to get the data, seems coming in utf8 but the characters need to be removed:
my $ua = LWP::UserAgent->new ();
$ua->default_headers->push_header ('Accept' =>
"text/html,application/xhtml" .
"+xml,application/xml");
$ua->default_headers->push_header ('Accept-Charset' => "utf-8");
my $doc = $ua->get ("https://jobvacancies.services.businesslink.gov.uk:8443/vacancy/26653478")
Upvotes: 0
Views: 2024
Reputation: 142316
Change your column definition to CHARACTER SET utf8mb4
so that the naughty character does not need to be removed, and can actually be stored.
Upvotes: 0
Reputation: 23870
If you just want to remove the characters outside the 7-bit ascii set (which are sufficient to display messages in english), you can you do this:
$desc=~s/[^\x00-\x7f]//g
Edit: If you want something more elaborate that supports the entire latin-1
set, you can do this:
use Encode;
$desc=encode('latin-1',$desc,sub {''});
This will remove exactly the characters that cannot be represented by latin-1
. Note that this line expects that the utf-8
flag is on for the string $desc
and that the resulting string will have the utf-8
flag is off.
Finally, if you want to preserve the euro sign (€), please note that you cannot do that with latin-1
because it is not part of that encoding. You will have to use a different encoding, such as ISO-8859-15
.
Upvotes: 2
Reputation: 8706
The content sent by the web service is XML that contains HTML in the Description
tag. If this is that content that worries you, another option than deleting non-Latin-1 character is to encode characters using HTML encoding:
$desc =~ s/([^\x00-\x7f])/sprintf("&%d;", ord $1)/ge
Here is an example:
$ echo 'é' | perl -C -pE 's/([^\x00-\x7f])/sprintf("&%d;", ord $1)/ge'
&233;
Upvotes: 0