lmocsi
lmocsi

Reputation: 1086

Parsing utf-8 json with Perl

I'm trying to parse an utf-8 json file in Perl. https://jsonlint.com/ says the json is valid. Still I get the error message:

malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "\x{ef}\x{bb}\x{bf}{"...") at parse.pl line 15.

The code is:

use strict;
use utf8;
use JSON qw( );

my $filename = 'k2.json';

my $json_text = do {
   open(my $json_fh, $filename) or die("Can't open $filename: $!\n");
   local $/;
   <$json_fh>
};

my $json = JSON->new;
my $data = $json->decode($json_text);

for ( @{$data->{data}} ) {
   print $_->{lng}."\n";
}

The utf-8 coded json is:

{"data":
[{"lng":"19.03252602",
"lat":"47.49795914",
"display_name":"I. kerület (Attila út)",
"active":"1",
"url":"/hu/kormanyablakok/budapest/i-kerulet/i-kerulet-attila-ut/283"
}]
}

I see that (ef, bb, bf) are the three bytes that indicate that it's an utf-8 document, so I don't understand what JSON package is missing here. How can I make it work?
Specifying "<:encoding(UTF-8)" at opening the file did not help either...

Upvotes: 0

Views: 1255

Answers (2)

ikegami
ikegami

Reputation: 385867

use strict;
use warnings qw( all );
use utf8;
use open ':std', ':encoding(UTF-8)';
use feature qw( say );

use JSON qw( );

my $filename = 'k2.json';

my $json_text = do {
   open(my $json_fh, '<', $filename)
      or die("Can't open $filename: $!\n");

   local $/;
   <$json_fh>
};

$json_text =~ s/^\N{BOM}//;

my $data = JSON->new->decode($json_text);

say $_->{lng} for @{ $data->{data} };

or

use strict;
use warnings qw( all );
use utf8;
use open ':std', ':encoding(UTF-8)';
use feature qw( say );

use File::BOM qw( open_bom );
use JSON      qw( );

my $filename = 'k2.json';

my $json_text = do {
   open_bom(my $fh, $file, ':encoding(UTF-8)')
      or die("Can't open $filename: $!\n");

   local $/;
   <$json_fh>
};

my $data = JSON->new->decode($json_text);

say $_->{lng} for @{ $data->{data} };

Notes:

  • use open ':std', ':encoding(UTF-8)'; causes printing STDOUT to encode using UTF-8. This will be required to print the display_name in your example.

    It also sets the default encoding that's used to decode the JSON file in the first snippet.

  • I left in use utf8;, but it doesn't do anything since the source code is entirely ASCII.

Upvotes: 1

mob
mob

Reputation: 118605

JSON does not expect input to have the byte order mark. Strip it before you run the JSON decoder.

$json_text =~ s/^[^\x00-\x7f]+//;
my $data = $json->decode($json_text);

The byte-order mark was not pasted to JSONlint, so JSONlint was not evaluating the same document that you have.

Upvotes: 2

Related Questions