manub
manub

Reputation: 53

Issue matching accented characters with Perl

This code:

perl -pe 's/^(\D\w+ \w+)( word )/\1;word/gi'

doesn't work when the input has words with accented or particular characters like: á, Ș.

Precisations:

I have this code to make a count of the only artist files.

find /PATH/ -type f -exec basename "{}" + 2>/dev/null |

perl -pe 's/ - .*//g' | LC_ALL=C  sort -f | uniq -c -i|

gsed -e 's/$/;/'|

awk '{numero=$1;$1=""}{print $0,numero}'|

perl -pe 's/^(\D\w+ \w+)( & )/\1;&/g' | 
perl -pe 's/^(\D\w+ \w+ \w+)( & >)/\1;&/g' | 
perl -pe 's/^(\D\w+ \w+ \w+ \w+)( & )/\1;&/g' | 
perl -pe >'s/^(\D\w+ \w+ \w+ \w+ \w+)( & )/\1;&/g' |

perl -pe 's/^(\D\w+ \w+)( Con )/\1;Con/gi' | 
perl -pe 's/^(\D\w+ \w+ >\w+)( Con )/\1;Con/gi' | 
perl -pe 's/^(\D\w+ \w+ \w+ \w+)( Con >)/\1;Con/gi' |  
perl -pe 's/^(\D\w+ \w+ \w+ \w+ \w+)( Con )/\1;Con/gi'|

perl -pe 's/^(\D\w+ \w+)( Și )/\1;Și/gi' | 
perl -pe 's/^(\D\w+ \w+ \w+)( >Și )/\1;Și/gi' | 
perl -pe 's/^(\D\w+ \w+ \w+ \w+)( Și )/\1;Și/gi' | 
perl >-pe 's/^(\D\w+ \w+ \w+ \w+ \w+)( Și )/\1;Și/gi'| > /PATH/File.txt

I’ve these files:

Betty Curtis & Orchestra - Song Title
Betty Curtis Con Johnny Dorelli - Song Title
Betty Curtis - Song Title
Margareta Pâslaru - Song Title
Margareta Pâslaru & Grup - Song Title
Margareta Pâslaru Și Sincron - Song Title
Matilde Sánchez - Song Title
Matilde Sánchez Con El Mariachi Vargas De Tecalitlán - Song Title

The output desidered would be:

Betty Curtis; 3
Margareta Pâslaru; 3
Matilde Sánchez; 2

The output that comes instead is:

Betty Curtis; 3
Margareta Pâslaru; 1
Margareta Pâslaru & Grup; 1
Margareta Pâslaru Și Sincron; 1
Matilde Sánchez; 1
Matilde Sánchez Con El Mariachi Vargas De Tecalitlán; 1

Exactly, the code is very complicated (the entire script counts nineteen lines...). The rule is to truncate the name if there are conjunctions, or paranthesis, except if the name is composed of a single word. If there are no conjunctions, or paranthesis, the name is saved in full

eg: “Gervis Quebodeaux Rayne Serenaders” remains “Gervis Quebodeaux Rayne Serenaders;

I'd like to compact the "Perl -pe" section: (D w + w +), (D w + w + w +) etc ... is boring. But I do not know how I can do it.

I had to find a balance between summary to make the count and the need to keep as much information as possible.

I have, at the moment, 30 cases (rules) in addition to “&” I’ve “ With ” “ Con ” “ e ” “ Y ” “ Et ” “ Und “… etc in many languages of the world.

The script works fine but does not work with names where there are accented and particular letters

The script works like this:

For example, I have many files of Duke Ellington, with many different historical headers.

Duke Ellington: 2 files
Duke Ellington & Cotton Club O.: 3
Duke Ellington & His Famous O.: 7
Duke Ellington & His Famous O.;(Ft. Ben Webster): 4
Duke Ellington & His Famous O.;(Ft. Johnny Hodges): 3
Duke Ellington & His O.: 129 
Duke Ellington & His O. (ft. Ben Webster): 14
Duke Ellington & His O. (Ft. Johnny Hodges): 8
Duke Ellington & His O. (pn.): 2
Duke Ellington &His O. (v. Al Hibble): 1
Duke Ellington &His O. (v. Al Hibbler): 1
Duke Ellington &His O. (v. Herb Jeffries): 9
Duke Ellington &His O. (v. Ozzie Bailey): 1
Duke Ellington &His O. (v. Ozzie Bailey, Ray Nance Vln.): 1
Duke Ellington &His O.;(v. Ray Nance?): 1
Duke Ellington &His O.;(v.M): 1
Duke Ellington (Ft. Rhythm Boys (2°c Bing Crosby, Al Rinker, & Harry Barris)): 1
Duke Ellington (Ft. Rhythm Boys (Bing Crosby, Al Rinker, & Harry Barris)): 1
Duke Ellington (v. Dick Robertson): 1
Duke Ellington w Count Basie: 3
Duke Ellington w Gerald Wilson: 13
Duke Ellington’s Spacemen: 1
Duke Ellington’s Washingtonians: 1

Through the work of the script that produces this file

Duke Ellington; 2
Duke Ellington;&Cotton Club O.; 3
Duke Ellington;&His Famous O.; 7
Duke Ellington;&His Famous O.;(Ft. Ben Webster); 4
Duke Ellington;&His Famous O.;(Ft. Johnny Hodges); 3
Duke Ellington;&His O.; 129
Duke Ellington;&His O.;(ft. Ben Webster); 14
Duke Ellington;&His O.;(Ft. Johnny Hodges); 8
Duke Ellington;&His O.;(pn.); 2
Duke Ellington;&His O.;(v. Al Hibble); 1
Duke Ellington;&His O.;(v. Al Hibbler); 1
Duke Ellington;&His O.;(v. Herb Jeffries); 9
Duke Ellington;&His O.;(v. Ozzie Bailey); 1
Duke Ellington;&His O.;(v. Ozzie Bailey, Ray Nance Vln.); 1
Duke Ellington;&His O.;(v. Ray Nance?); 1
Duke Ellington;&His O.;(v.M); 1
Duke Ellington;(Ft. Rhythm Boys (2°c Bing Crosby, Al Rinker, & Harry Barris)); 1
Duke Ellington;(Ft. Rhythm Boys (Bing Crosby, Al Rinker, & Harry Barris)); 1
Duke Ellington;(v. Dick Robertson); 1
Duke Ellington;w Count Basie; 3
Duke Ellington;w Gerald Wilson; 13
Duke Ellington; Spacemen; 1
Duke Ellington; Washingtonians; 1

This is the output:

Duke Ellington: 208

Code complete: https://www.sendspace.com/file/dlep9q

Upvotes: 4

Views: 185

Answers (1)

zdim
zdim

Reputation: 66873

The shown one-liner doesn't enable any unicode support. You'd want, at least, to set up input/output streams for it, and in a script I'd recommend

use open qw(:std :encoding(UTF-8));

In a one-liner there are switches; see what combination you need in perlrun, under -C. For example

echo "á, Ș." | perl -CASD -wnE'@m = /\w+/g; say for @m'

prints

á
Ș

so the accented characters are understood.

Additionally, you may need \X (instead of \w) to match an extended grapheme cluster.


This post may be relevant, with a comforting first part but scary (and informative) rest.

Literature: perlunitut, perlunifaq, perluniintro (with its Unicode I/O for example), and perlunicode. Have perluniprops handy. There is also a cookbook of sorts, perlunicook (see Standard preamble for starters), and there's Encode.

Note that the regex per se is unicode aware.


The question got edited, with additions of code, example input and its processing, and a link to a complete program. Some clarification on how names are decided are added, for example:

The rule is to truncate the name if there are conjunctions, or paranthesis, except if the name is composed of a single word. If there are no conjunctions, or paranthesis, the name is saved in full

which means that the truncated name need be at least two-words long, or the string shouldn't be truncated (as clarified in comments). This bypasses almost completely the very difficult problem of parsing names in natural languages, since the "conjunctions" are meant to be provided.

Using a few from that list (from a program linked in the question), for a demo

use warnings;
use strict;
use feature 'say';

use utf8;                            # for utf8 characters in this script
use open qw(:std :encoding(UTF-8));  # for standard streams

sub extract_name {
    my ($line) = @_;
    # Rule for extracting the name:
    #   Truncate at $cutoff phrase if there are at least two words before it
    #   (incomplete list of alternations for a demo, from linked program)
    my $cutoff = qr{\s+(?:-|&|And|Con|Și)(?:\s+|\z)};  # with spaces
    my $parens = qr{\s+\(};                            # no space after

    # If there is a cut-off phrase on the line, extract what's before it
    # If that is at least two words long, return it;
    #   otherwise, return the whole line 
    if ( my ($name) = $line =~ /(.*?)(?:$cutoff|$parens)/ ) {
        return $name if split(' ', $name) >= 2;
    }
    return $line;
}

my $file = shift // die "Usage: $0 file\n";

open my $fh, '<', $file or die "Can't open $file: $!";

my %name_count;
while (my $line = <$fh>) { 
    chomp $line;
    ++$name_count{ extract_name($line) };
}

say "$_; $name_count{$_}" for sort keys %name_count;

The regex pattern for a "conjunction" (cutoff phrase) is formed using qr operator for easier work. It is simply an alternation (|) of given conjunctions, here a few picked up from the linked program. I separate those that don't need a trailing space into another pattern, here only for parenthesis.

It is a good idea to sort reports as they are printed so I do this even though sort with cmp may produce incorrect results with unicode; please see this post for how to correctly sort with utf8.

I test this with the input shown in the question, to which I add lines

Johnny & The Hurricanes
An Awesome Band (Unknown)

so to be able to test the finer points of the criteria for the name. It prints

An Awesome Band; 1
Betty Curtis; 3
Johny & The Hurricanes; 1
Margareta Pâslaru; 3
Matilde Sánchez; 2

I strongly advise against a "one"-liner for a job of this complexity (I could barely get the above sub to parse and work correctly when packed into a command-line).

If this program needs to work with lines piped into it let me know and I can add that.

Upvotes: 6

Related Questions