kandan
kandan

Reputation: 747

Perl Remove invalid characters, invalid latin1 characters from string

I have a perl script that reads from a web service and saves in a mysql table. this table uses latin1. from the web service there are coming some wrong characters and need to remove them before saving them in the database, otherwise they get saved as '?'

wanted to do something similar as: $desc=~s///gsi;

but is not removing them.

the webservice that has the wrong characters is: https://jobvacancies.services.businesslink.gov.uk:8443/vacancy/26653478

using a user agent to get the data, seems coming in utf8 but the characters need to be removed:

my $ua = LWP::UserAgent->new ();

$ua->default_headers->push_header ('Accept' => 
                   "text/html,application/xhtml" .
                   "+xml,application/xml");
$ua->default_headers->push_header ('Accept-Charset' => "utf-8");

my $doc = $ua->get ("https://jobvacancies.services.businesslink.gov.uk:8443/vacancy/26653478")

Upvotes: 0

Views: 2024

Answers (3)

Rick James
Rick James

Reputation: 142316

Change your column definition to CHARACTER SET utf8mb4 so that the naughty character does not need to be removed, and can actually be stored.

Upvotes: 0

redneb
redneb

Reputation: 23870

If you just want to remove the characters outside the 7-bit ascii set (which are sufficient to display messages in english), you can you do this:

$desc=~s/[^\x00-\x7f]//g

Edit: If you want something more elaborate that supports the entire latin-1 set, you can do this:

use Encode;

$desc=encode('latin-1',$desc,sub {''});

This will remove exactly the characters that cannot be represented by latin-1. Note that this line expects that the utf-8 flag is on for the string $desc and that the resulting string will have the utf-8 flag is off.

Finally, if you want to preserve the euro sign (€), please note that you cannot do that with latin-1 because it is not part of that encoding. You will have to use a different encoding, such as ISO-8859-15.

Upvotes: 2

dolmen
dolmen

Reputation: 8706

The content sent by the web service is XML that contains HTML in the Description tag. If this is that content that worries you, another option than deleting non-Latin-1 character is to encode characters using HTML encoding:

$desc =~ s/([^\x00-\x7f])/sprintf("&%d;", ord $1)/ge

Here is an example:

$ echo 'é' | perl -C -pE 's/([^\x00-\x7f])/sprintf("&%d;", ord $1)/ge'
&233;

Upvotes: 0

Related Questions