user26732
user26732

Reputation: 327

Awk/Sed Solution for English/Chinese Text?

I have a text file. There are hundreds of lines. Each line is either in English or in Chinese characters, but not both (there are a few exceptions but perhaps less than <10, so these are discoverable and manageable). A single line may contain multiple sentences. What I would like to end up with is two files; one in English; the other in Chinese.

The lines tend to alternate languages, but not always. Sometimes there might be two lines in English, followed by one line in Chinese.

Is there a way to use Sed or Awk to divide the languages into two different text files?

Upvotes: 0

Views: 973

Answers (1)

Kent
Kent

Reputation: 195179

This one-liner might help:

awk '/[^\x00-\x7f]/{print >"cn.txt";next}{print > "en.txt"}' file

It will generate two files cn.txt and en.txt. It checks if the line contains at least one non-ascii character, if found one, the line would be considered as Chinese line.

Little test:

kent$  cat f
this is line1 in english 
你好
this is line2 in english 
你好你好
this is line3 in english 
this is line4 in english 
你好你好你好

kent$  awk '/[^\x00-\x7f]/{print >"cn.txt";next}{print > "en.txt"}' f

kent$  head *.txt
==> cn.txt <==
你好
你好你好
你好你好你好

==> en.txt <==
this is line1 in english 
this is line2 in english 
this is line3 in english 
this is line4 in english

Upvotes: 2

Related Questions