codeObserver
codeObserver

Reputation: 6647

Get only english words from a file

I have many words [MM] in a file.

I ran this command:

cat file.txt | tr " " "\n"| sort | uniq  > uniq.out

I found that there are many chinese words and some alphanumeric and with special characters

I want to get all the words which are just english [A-Z][a-z] ONLY

grep -E "[A-Za-z]" uniq.out | grep -Ev "[0-9]" | less

The above command also matches alpha-numeric words.

Any suggestions ?

Thanks!

Upvotes: 0

Views: 1185

Answers (2)

tchrist
tchrist

Reputation: 80405

Why run four commands when just one alone does the job?

English is written in the Latin script. Therefore this pulls out all the unique Latin-scripted words:

$ perl -CSD -nle '$seen{$1}++ || print $1 while /\b(\p{Latin}+)\b/g' input_file.utf8

But you’ll miss all the words with apostrophes or hyphens in them. Sure you don’t want those, too?

To actually know whether they’re valid words in English requires access to a good dictionary, plus rules for inflexions. Otherwise you’ll get false positive like “xyzzy”, and suchlike.

Upvotes: 0

buckley
buckley

Reputation: 14089

Use

^[A-Za-z]+$

( Your regex just said that it had to contain 1 a-z character for the line to count as a match)

Upvotes: 1

Related Questions