Reputation: 53
This code:
perl -pe 's/^(\D\w+ \w+)( word )/\1;word/gi'
doesn't work when the input has words with accented or particular characters like: á, Ș.
Precisations:
I have this code to make a count of the only artist files.
find /PATH/ -type f -exec basename "{}" + 2>/dev/null |
perl -pe 's/ - .*//g' | LC_ALL=C sort -f | uniq -c -i|
gsed -e 's/$/;/'|
awk '{numero=$1;$1=""}{print $0,numero}'|
perl -pe 's/^(\D\w+ \w+)( & )/\1;&/g' |
perl -pe 's/^(\D\w+ \w+ \w+)( & >)/\1;&/g' |
perl -pe 's/^(\D\w+ \w+ \w+ \w+)( & )/\1;&/g' |
perl -pe >'s/^(\D\w+ \w+ \w+ \w+ \w+)( & )/\1;&/g' |
perl -pe 's/^(\D\w+ \w+)( Con )/\1;Con/gi' |
perl -pe 's/^(\D\w+ \w+ >\w+)( Con )/\1;Con/gi' |
perl -pe 's/^(\D\w+ \w+ \w+ \w+)( Con >)/\1;Con/gi' |
perl -pe 's/^(\D\w+ \w+ \w+ \w+ \w+)( Con )/\1;Con/gi'|
perl -pe 's/^(\D\w+ \w+)( Și )/\1;Și/gi' |
perl -pe 's/^(\D\w+ \w+ \w+)( >Și )/\1;Și/gi' |
perl -pe 's/^(\D\w+ \w+ \w+ \w+)( Și )/\1;Și/gi' |
perl >-pe 's/^(\D\w+ \w+ \w+ \w+ \w+)( Și )/\1;Și/gi'| > /PATH/File.txt
I’ve these files:
Betty Curtis & Orchestra - Song Title Betty Curtis Con Johnny Dorelli - Song Title Betty Curtis - Song Title Margareta Pâslaru - Song Title Margareta Pâslaru & Grup - Song Title Margareta Pâslaru Și Sincron - Song Title Matilde Sánchez - Song Title Matilde Sánchez Con El Mariachi Vargas De Tecalitlán - Song Title
The output desidered would be:
Betty Curtis; 3 Margareta Pâslaru; 3 Matilde Sánchez; 2
The output that comes instead is:
Betty Curtis; 3 Margareta Pâslaru; 1 Margareta Pâslaru & Grup; 1 Margareta Pâslaru Și Sincron; 1 Matilde Sánchez; 1 Matilde Sánchez Con El Mariachi Vargas De Tecalitlán; 1
Exactly, the code is very complicated (the entire script counts nineteen lines...). The rule is to truncate the name if there are conjunctions, or paranthesis, except if the name is composed of a single word. If there are no conjunctions, or paranthesis, the name is saved in full
eg: “Gervis Quebodeaux Rayne Serenaders” remains “Gervis Quebodeaux Rayne Serenaders;
I'd like to compact the "Perl -pe" section: (D w + w +), (D w + w + w +) etc ... is boring. But I do not know how I can do it.
I had to find a balance between summary to make the count and the need to keep as much information as possible.
I have, at the moment, 30 cases (rules) in addition to “&” I’ve “ With ” “ Con ” “ e ” “ Y ” “ Et ” “ Und “… etc in many languages of the world.
The script works fine but does not work with names where there are accented and particular letters
The script works like this:
For example, I have many files of Duke Ellington, with many different historical headers.
Duke Ellington: 2 files Duke Ellington & Cotton Club O.: 3 Duke Ellington & His Famous O.: 7 Duke Ellington & His Famous O.;(Ft. Ben Webster): 4 Duke Ellington & His Famous O.;(Ft. Johnny Hodges): 3 Duke Ellington & His O.: 129 Duke Ellington & His O. (ft. Ben Webster): 14 Duke Ellington & His O. (Ft. Johnny Hodges): 8 Duke Ellington & His O. (pn.): 2 Duke Ellington &His O. (v. Al Hibble): 1 Duke Ellington &His O. (v. Al Hibbler): 1 Duke Ellington &His O. (v. Herb Jeffries): 9 Duke Ellington &His O. (v. Ozzie Bailey): 1 Duke Ellington &His O. (v. Ozzie Bailey, Ray Nance Vln.): 1 Duke Ellington &His O.;(v. Ray Nance?): 1 Duke Ellington &His O.;(v.M): 1 Duke Ellington (Ft. Rhythm Boys (2°c Bing Crosby, Al Rinker, & Harry Barris)): 1 Duke Ellington (Ft. Rhythm Boys (Bing Crosby, Al Rinker, & Harry Barris)): 1 Duke Ellington (v. Dick Robertson): 1 Duke Ellington w Count Basie: 3 Duke Ellington w Gerald Wilson: 13 Duke Ellington’s Spacemen: 1 Duke Ellington’s Washingtonians: 1
Through the work of the script that produces this file
Duke Ellington; 2 Duke Ellington;&Cotton Club O.; 3 Duke Ellington;&His Famous O.; 7 Duke Ellington;&His Famous O.;(Ft. Ben Webster); 4 Duke Ellington;&His Famous O.;(Ft. Johnny Hodges); 3 Duke Ellington;&His O.; 129 Duke Ellington;&His O.;(ft. Ben Webster); 14 Duke Ellington;&His O.;(Ft. Johnny Hodges); 8 Duke Ellington;&His O.;(pn.); 2 Duke Ellington;&His O.;(v. Al Hibble); 1 Duke Ellington;&His O.;(v. Al Hibbler); 1 Duke Ellington;&His O.;(v. Herb Jeffries); 9 Duke Ellington;&His O.;(v. Ozzie Bailey); 1 Duke Ellington;&His O.;(v. Ozzie Bailey, Ray Nance Vln.); 1 Duke Ellington;&His O.;(v. Ray Nance?); 1 Duke Ellington;&His O.;(v.M); 1 Duke Ellington;(Ft. Rhythm Boys (2°c Bing Crosby, Al Rinker, & Harry Barris)); 1 Duke Ellington;(Ft. Rhythm Boys (Bing Crosby, Al Rinker, & Harry Barris)); 1 Duke Ellington;(v. Dick Robertson); 1 Duke Ellington;w Count Basie; 3 Duke Ellington;w Gerald Wilson; 13 Duke Ellington; Spacemen; 1 Duke Ellington; Washingtonians; 1
This is the output:
Duke Ellington: 208
Code complete: https://www.sendspace.com/file/dlep9q
Upvotes: 4
Views: 185
Reputation: 66873
The shown one-liner doesn't enable any unicode support.† You'd want, at least, to set up input/output streams for it, and in a script I'd recommend
use open qw(:std :encoding(UTF-8));
In a one-liner there are switches; see what combination you need in perlrun, under -C
. For example
echo "á, Ș." | perl -CASD -wnE'@m = /\w+/g; say for @m'
prints
á Ș
so the accented characters are understood.
Additionally, you may need \X
(instead of \w
) to match an extended grapheme cluster.
† This post may be relevant, with a comforting first part but scary (and informative) rest.
Literature: perlunitut, perlunifaq, perluniintro (with its Unicode I/O for example), and perlunicode. Have perluniprops handy. There is also a cookbook of sorts, perlunicook (see Standard preamble for starters), and there's Encode.
Note that the regex per se is unicode aware.
The question got edited, with additions of code, example input and its processing, and a link to a complete program. Some clarification on how names are decided are added, for example:
The rule is to truncate the name if there are conjunctions, or paranthesis, except if the name is composed of a single word. If there are no conjunctions, or paranthesis, the name is saved in full
which means that the truncated name need be at least two-words long, or the string shouldn't be truncated (as clarified in comments). This bypasses almost completely the very difficult problem of parsing names in natural languages, since the "conjunctions" are meant to be provided.
Using a few from that list (from a program linked in the question), for a demo
use warnings;
use strict;
use feature 'say';
use utf8; # for utf8 characters in this script
use open qw(:std :encoding(UTF-8)); # for standard streams
sub extract_name {
my ($line) = @_;
# Rule for extracting the name:
# Truncate at $cutoff phrase if there are at least two words before it
# (incomplete list of alternations for a demo, from linked program)
my $cutoff = qr{\s+(?:-|&|And|Con|Și)(?:\s+|\z)}; # with spaces
my $parens = qr{\s+\(}; # no space after
# If there is a cut-off phrase on the line, extract what's before it
# If that is at least two words long, return it;
# otherwise, return the whole line
if ( my ($name) = $line =~ /(.*?)(?:$cutoff|$parens)/ ) {
return $name if split(' ', $name) >= 2;
}
return $line;
}
my $file = shift // die "Usage: $0 file\n";
open my $fh, '<', $file or die "Can't open $file: $!";
my %name_count;
while (my $line = <$fh>) {
chomp $line;
++$name_count{ extract_name($line) };
}
say "$_; $name_count{$_}" for sort keys %name_count;
The regex pattern for a "conjunction" (cutoff phrase) is formed using qr
operator for easier work. It is simply an alternation (|
) of given conjunctions, here a few picked up from the linked program. I separate those that don't need a trailing space into another pattern, here only for parenthesis.
It is a good idea to sort reports as they are printed so I do this even though sort
with cmp
may produce incorrect results with unicode; please see this post for how to correctly sort with utf8.
I test this with the input shown in the question, to which I add lines
Johnny & The Hurricanes An Awesome Band (Unknown)
so to be able to test the finer points of the criteria for the name. It prints
An Awesome Band; 1 Betty Curtis; 3 Johny & The Hurricanes; 1 Margareta Pâslaru; 3 Matilde Sánchez; 2
I strongly advise against a "one"-liner for a job of this complexity (I could barely get the above sub to parse and work correctly when packed into a command-line).
If this program needs to work with lines piped into it let me know and I can add that.
Upvotes: 6