Reputation: 6647
I have many words [MM] in a file.
I ran this command:
cat file.txt | tr " " "\n"| sort | uniq > uniq.out
I found that there are many chinese words and some alphanumeric and with special characters
I want to get all the words which are just english [A-Z][a-z] ONLY
grep -E "[A-Za-z]" uniq.out | grep -Ev "[0-9]" | less
The above command also matches alpha-numeric words.
Any suggestions ?
Thanks!
Upvotes: 0
Views: 1185
Reputation: 80405
Why run four commands when just one alone does the job?
English is written in the Latin script. Therefore this pulls out all the unique Latin-scripted words:
$ perl -CSD -nle '$seen{$1}++ || print $1 while /\b(\p{Latin}+)\b/g' input_file.utf8
But you’ll miss all the words with apostrophes or hyphens in them. Sure you don’t want those, too?
To actually know whether they’re valid words in English requires access to a good dictionary, plus rules for inflexions. Otherwise you’ll get false positive like “xyzzy”, and suchlike.
Upvotes: 0
Reputation: 14089
Use
^[A-Za-z]+$
( Your regex just said that it had to contain 1 a-z character for the line to count as a match)
Upvotes: 1