Issue matching accented characters with Perl

Question

This code:

perl -pe 's/^(\D\w+ \w+)( word )/\1;word/gi'

doesn't work when the input has words with accented or particular characters like: á, Ș.

Precisations:

I have this code to make a count of the only artist files.

find /PATH/ -type f -exec basename "{}" + 2>/dev/null |

perl -pe 's/ - .*//g' | LC_ALL=C  sort -f | uniq -c -i|

gsed -e 's/$/;/'|

awk '{numero=$1;$1=""}{print $0,numero}'|

perl -pe 's/^(\D\w+ \w+)( & )/\1;&/g' | 
perl -pe 's/^(\D\w+ \w+ \w+)( & >)/\1;&/g' | 
perl -pe 's/^(\D\w+ \w+ \w+ \w+)( & )/\1;&/g' | 
perl -pe >'s/^(\D\w+ \w+ \w+ \w+ \w+)( & )/\1;&/g' |

perl -pe 's/^(\D\w+ \w+)( Con )/\1;Con/gi' | 
perl -pe 's/^(\D\w+ \w+ >\w+)( Con )/\1;Con/gi' | 
perl -pe 's/^(\D\w+ \w+ \w+ \w+)( Con >)/\1;Con/gi' |  
perl -pe 's/^(\D\w+ \w+ \w+ \w+ \w+)( Con )/\1;Con/gi'|

perl -pe 's/^(\D\w+ \w+)( Și )/\1;Și/gi' | 
perl -pe 's/^(\D\w+ \w+ \w+)( >Și )/\1;Și/gi' | 
perl -pe 's/^(\D\w+ \w+ \w+ \w+)( Și )/\1;Și/gi' | 
perl >-pe 's/^(\D\w+ \w+ \w+ \w+ \w+)( Și )/\1;Și/gi'| > /PATH/File.txt

I’ve these files:

Betty Curtis & Orchestra - Song Title
Betty Curtis Con Johnny Dorelli - Song Title
Betty Curtis - Song Title
Margareta Pâslaru - Song Title
Margareta Pâslaru & Grup - Song Title
Margareta Pâslaru Și Sincron - Song Title
Matilde Sánchez - Song Title
Matilde Sánchez Con El Mariachi Vargas De Tecalitlán - Song Title

The output desidered would be:

Betty Curtis; 3
Margareta Pâslaru; 3
Matilde Sánchez; 2

The output that comes instead is:

Betty Curtis; 3
Margareta Pâslaru; 1
Margareta Pâslaru & Grup; 1
Margareta Pâslaru Și Sincron; 1
Matilde Sánchez; 1
Matilde Sánchez Con El Mariachi Vargas De Tecalitlán; 1

Exactly, the code is very complicated (the entire script counts nineteen lines...). The rule is to truncate the name if there are conjunctions, or paranthesis, except if the name is composed of a single word. If there are no conjunctions, or paranthesis, the name is saved in full

eg: “Gervis Quebodeaux Rayne Serenaders” remains “Gervis Quebodeaux Rayne Serenaders;

I'd like to compact the "Perl -pe" section: (D w + w +), (D w + w + w +) etc ... is boring. But I do not know how I can do it.

I had to find a balance between summary to make the count and the need to keep as much information as possible.

I have, at the moment, 30 cases (rules) in addition to “&” I’ve “ With ” “ Con ” “ e ” “ Y ” “ Et ” “ Und “… etc in many languages of the world.

The script works fine but does not work with names where there are accented and particular letters

The script works like this:

For example, I have many files of Duke Ellington, with many different historical headers.

Duke Ellington: 2 files
Duke Ellington & Cotton Club O.: 3
Duke Ellington & His Famous O.: 7
Duke Ellington & His Famous O.;(Ft. Ben Webster): 4
Duke Ellington & His Famous O.;(Ft. Johnny Hodges): 3
Duke Ellington & His O.: 129 
Duke Ellington & His O. (ft. Ben Webster): 14
Duke Ellington & His O. (Ft. Johnny Hodges): 8
Duke Ellington & His O. (pn.): 2
Duke Ellington &His O. (v. Al Hibble): 1
Duke Ellington &His O. (v. Al Hibbler): 1
Duke Ellington &His O. (v. Herb Jeffries): 9
Duke Ellington &His O. (v. Ozzie Bailey): 1
Duke Ellington &His O. (v. Ozzie Bailey, Ray Nance Vln.): 1
Duke Ellington &His O.;(v. Ray Nance?): 1
Duke Ellington &His O.;(v.M): 1
Duke Ellington (Ft. Rhythm Boys (2°c Bing Crosby, Al Rinker, & Harry Barris)): 1
Duke Ellington (Ft. Rhythm Boys (Bing Crosby, Al Rinker, & Harry Barris)): 1
Duke Ellington (v. Dick Robertson): 1
Duke Ellington w Count Basie: 3
Duke Ellington w Gerald Wilson: 13
Duke Ellington’s Spacemen: 1
Duke Ellington’s Washingtonians: 1

Through the work of the script that produces this file

Duke Ellington; 2
Duke Ellington;&Cotton Club O.; 3
Duke Ellington;&His Famous O.; 7
Duke Ellington;&His Famous O.;(Ft. Ben Webster); 4
Duke Ellington;&His Famous O.;(Ft. Johnny Hodges); 3
Duke Ellington;&His O.; 129
Duke Ellington;&His O.;(ft. Ben Webster); 14
Duke Ellington;&His O.;(Ft. Johnny Hodges); 8
Duke Ellington;&His O.;(pn.); 2
Duke Ellington;&His O.;(v. Al Hibble); 1
Duke Ellington;&His O.;(v. Al Hibbler); 1
Duke Ellington;&His O.;(v. Herb Jeffries); 9
Duke Ellington;&His O.;(v. Ozzie Bailey); 1
Duke Ellington;&His O.;(v. Ozzie Bailey, Ray Nance Vln.); 1
Duke Ellington;&His O.;(v. Ray Nance?); 1
Duke Ellington;&His O.;(v.M); 1
Duke Ellington;(Ft. Rhythm Boys (2°c Bing Crosby, Al Rinker, & Harry Barris)); 1
Duke Ellington;(Ft. Rhythm Boys (Bing Crosby, Al Rinker, & Harry Barris)); 1
Duke Ellington;(v. Dick Robertson); 1
Duke Ellington;w Count Basie; 3
Duke Ellington;w Gerald Wilson; 13
Duke Ellington; Spacemen; 1
Duke Ellington; Washingtonians; 1

This is the output:

Duke Ellington: 208

Code complete: https://www.sendspace.com/file/dlep9q

Issue matching accented characters with Perl

Answers (1)

Related Questions