Banzai Otis
Banzai Otis

Reputation: 63

Grep while excluding special letters like umlauts

I'm running Mint Xfce and attempting to grep from terminal using the following:

grep -E -o '^[A-Za-z]{1,}\s[A-Za-z]{1,}\s[0-9]{1,}' sourcefile.txt | sort -f > newfile.txt

The source file is a text file where each line looks like

<string><space><string><tab><number><tab><number><tab>...

where the strings have letters, numbers, punctuation, and special characters and the numbers are integers.

My goal is to extract the two strings and first number for just the lines where the strings contain only English letters (a-z, upper or lower case).

The above command leaves out strings with punctuation and numbers, but lines where the strings have special letters like u umlauts (Ü) are somehow getting through and being sent to newfile.txt. I feel like I'm missing something obvious, but a ton of Googling only gives me back discussions on how to grep for special letters. I've tested the regex at https://regex101.com/ and umlauts don't get matched, which makes me think the problem isn't with my regex.

Thanks for any help you can provide!

Upvotes: 2

Views: 514

Answers (2)

RARE Kpop Manifesto
RARE Kpop Manifesto

Reputation: 2805

you don't need to fudge any of that LOCALE stuff :

echo '<string1><space><string2><tab><number1><tab><number2><tab>...' |
gawk 'NF = 3*!/[^\0-\177]/' OFS='\n' FS='<(space|tab)>' # the demo delims

mawk 'NF = 3*!/[^\0-\177]/' OFS='\n'                    # the actual delims
<string1>
<string2>
<number1>

By doing it as a multiply, it sets NF to either 3 if it's ASCII or 0 if any multi-byte Unicode detected thus scrubbing the entire line clean instead.

If youre pedantic about guarding against non-Unicode-compliant random binary bytes while using gawk unicode mode, then try

gawk 'NF = 3*/^[\0-\177]+$/'

But if you INSIST on only using nawk instead of the other 2, then add a pair of ( ) ::

nawk 'NF = 3*(!/[^\0-\177]/)' OFS='\n'

to circumvent the fatal error arising from its buggy parser, despite the code itself being POSIX-compliant

Upvotes: 0

Grzegorz G&#243;rkiewicz
Grzegorz G&#243;rkiewicz

Reputation: 4586

You have to temporary change Locale. Try:

LC_ALL="C" grep -E -o '^[A-Za-z]{1,}\s[A-Za-z]{1,}\s[0-9]{1,}' sourcefile.txt | sort -f > newfile.txt

It worked for me on Ubuntu. To switch back to your Locale simply close the console window.

Upvotes: 2

Related Questions