Reputation: 24373
I have two files, wordlist.txt
and text.txt
.
The first file, wordlist.txt
, contains a huge list of words in Chinese, Japanese, and Korean, e.g.:
你
你们
我
The second file, text.txt
, contains long passages, e.g.:
你们要去哪里?
卡拉OK好不好?
I want to create a new word list (wordsfount.txt
), but it should only contain those lines from wordlist.txt
which are found at least once within text.txt
. The output file from the above should show this:
你
你们
"我" is not found in this list because it is never found in text.txt
.
I want to find a very fast way to create this list which only contains lines from the first file that are found in the second.
I know a simple way in BASH to check each line in worlist.txt
and see if it is in text.txt
using grep
:
a=1
while read line
do
c=`grep -c $line text.txt`
if [ "$c" -ge 1 ]
then
echo $line >> wordsfound.txt
echo "Found" $a
fi
echo "Not found" $a
a=`expr $a + 1`
done < wordlist.txt
Unfortunately, as wordlist.txt
is a very long list, this process takes many hours. There must be a faster solution. Here is one consideration:
As the files contain CJK letters, they can be thought of as a giant alphabet with about 8,000 letters. So nearly every word share characters. E.g.:
我
我们
Due to this fact, if "我" is never found within text.txt
, then it is quite logical that "我们" never appears either. A faster script might perhaps check "我" first, and upon finding that it is not present, would avoid checking every subsequent word contained withing wordlist.txt
that also contained within wordlist.txt
. If there are about 8,000 unique characters found in wordlist.txt
, then the script should not need to check so many lines.
What is the fastest way to create the list containing only those words that are in the first file that are also found somewhere within in the second?
Upvotes: 19
Views: 1344
Reputation: 58510
First TXR Lisp solution ( http://www.nongnu.org/txr ):
(defvar tg-hash (hash)) ;; tg == "trigraph"
(unless (= (len *args*) 2)
(put-line `arguments required: <wordfile> <textfile>`)
(exit nil))
(defvar wordfile [*args* 0])
(defvar textfile [*args* 1])
(mapcar (lambda (line)
(dotimes (i (len line))
(push line [tg-hash [line i..(succ i)]])
(push line [tg-hash [line i..(ssucc i)]])
(push line [tg-hash [line i..(sssucc i)]])))
(file-get-lines textfile))
(mapcar (lambda (word)
(if (< (len word) 4)
(if [tg-hash word]
(put-line word))
(if (find word [tg-hash [word 0..3]]
(op search-str @2 @1))
(put-line word))))
(file-get-lines wordfile))
The strategy here is to reduce the corpus of words to a hash table which is indexed on individual characters, digraphs and trigraphs occuring in the lines, associating these fragments with the lines. Then when we process the word list, this reduces the search effort.
Firstly if the word is short, three characters or less (probably common in Chinese words), we can try to get an instant match in the hash table. If no match, word is not in the corpus.
If the word is longer than three characters, we can try to get a match for the first three characters. That gives us a list of lines which contain a match for the trigraph. We can search those lines exhaustively to see which ones of them match the word. I suspect that this will greatly reduce the number of lines that have to be searched.
I would need your data, or something representative thereof, to be able to see what the behavior is like.
Sample run:
$ txr words.tl words.txt text.txt
water
fire
earth
the
$ cat words.txt
water
fire
earth
the
it
$ cat text.txt
Long ago people
believed that the four
elements were
just
water
fire
earth
(TXR reads UTF-8 and does all string manipulation in Unicode, so testing with ASCII characters is valid.)
The use of lazy lists means that we do not store the entire list of 300,000 words, for instance. Although we are using the Lisp mapcar
function, the list is being generated on the fly and because we don't keep the reference to the head of the list, it is eligible for garbage collection.
Unfortunately we do have to keep the text corpus in memory because the hash table associates lines.
If that's a problem, the solution could be reversed. Scan all the words, and then process the text corpus lazily, tagging those words which occur. Then eliminate the rest. I will post such a solution also.
Upvotes: 4
Reputation: 11677
I grabbed the text of War and Peace from the Gutenberg project and wrote the following script. If prints all words in /usr/share/dict/words
which are also in war_and_peace.txt
. You can change that with:
perl findwords.pl --wordlist=/path/to/wordlist --text=/path/to/text > wordsfound.txt
On my computer, it takes just over a second to run.
use strict;
use warnings;
use utf8::all;
use Getopt::Long;
my $wordlist = '/usr/share/dict/words';
my $text = 'war_and_peace.txt';
GetOptions(
"worlist=s" => \$wordlist,
"text=s" => \$text,
);
open my $text_fh, '<', $text
or die "Cannot open '$text' for reading: $!";
my %is_in_text;
while ( my $line = <$text_fh> ) {
chomp($line);
# you will want to customize this line
my @words = grep { $_ } split /[[:punct:][:space:]]/ => $line;
next unless @words;
# This beasty uses the 'x' builtin in list context to assign
# the value of 1 to all keys (the words)
@is_in_text{@words} = (1) x @words;
}
open my $wordlist_fh, '<', $wordlist
or die "Cannot open '$wordlist' for reading: $!";
while ( my $word = <$wordlist_fh> ) {
chomp($word);
if ( $is_in_text{$word} ) {
print "$word\n";
}
}
And here's my timing:
• [ovid] $ wc -w war_and_peace.txt
565450 war_and_peace.txt
• [ovid] $ time perl findwords.pl > wordsfound.txt
real 0m1.081s
user 0m1.076s
sys 0m0.000s
• [ovid] $ wc -w wordsfound.txt
15277 wordsfound.txt
Upvotes: 12
Reputation: 5069
Use paralel processing to speed up the processing.
1) sort & uniq on wordlist.txt, then split it to several files (X) Do some testing, X is equal with your computer cores.
split -d -l wordlist.txt
2) use xargs -p X -n 1 script.sh x00 > output-x00.txt to process the files in paralel
find ./splitted_files_dir -type f -name "x*" -print| xargs -p 20 -n 1 -I SPLITTED_FILE script.sh SPLITTED_FILE
3) cat output* > output.txt concatenate output files
This will speed up the processing enough, and you are able to use tools that you could understand. This will ease up the maintinging "cost".
The script almost identical that you used in the first place.
script.sh
FILE=$1
OUTPUTFILE="output-${FILE}.txt"
WORDLIST="wordliist.txt"
a=1
while read line
do
c=`grep -c $line ${FILE} `
if [ "$c" -ge 1 ]
then
echo $line >> ${OUTPUTFILE}
echo "Found" $a
fi
echo "Not found" $a
a=`expr $a + 1`
done < ${WORDLIST}
Upvotes: 3
Reputation: 7630
This solution is in perl, maintains your original symantics and uses the optimization you suggested.
#!/usr/bin/perl
@list=split("\n",`sort < ./wordlist.txt | uniq`);
$size=scalar(@list);
for ($i=0;$i<$size;++$i) { $list[$i]=quotemeta($list[$i]);}
for ($i=0;$i<$size;++$i) {
my $j = $i+1;
while ($list[$j]=~/^$list[$i]/) {
++$j;
}
$skip[$i]=($j-$i-1);
}
open IN,"<./text.txt" || die;
@text = (<IN>);
close IN;
foreach $c(@text) {
for ($i=0;$i<$size;++$i) {
if ($c=~/$list[$i]/) {
$found{$list[$i]}=1;
last;
}
else {
$i+=$skip[$i];
}
}
}
open OUT,">wordsfound.txt" ||die;
while ( my ($key, $value) = each(%found) ) {
print OUT "$key\n";
}
close OUT;
exit;
Upvotes: 3
Reputation: 27845
Quite sure not the fastest solution, but at least a working one (I hope).
This solution needs ruby 1.9, the text file are expected to be UTF-8.
#encoding: utf-8
#Get test data
$wordlist = File.readlines('wordlist.txt', :encoding => 'utf-8').map{|x| x.strip}
$txt = File.read('text.txt', :encoding => 'utf-8')
new_wordlist = []
$wordlist.each{|word|
new_wordlist << word if $txt.include?(word)
}
#Save the result
File.open('wordlist_new.txt', 'w:utf-8'){|f|
f << new_wordlist.join("\n")
}
Can you provide a bigger example to make some benchmark on different methods? (Perhaps some test files to download?)
Below a benchmark with four methods.
#encoding: utf-8
require 'benchmark'
N = 10_000 #Number of Test loops
#Get test data
$wordlist = File.readlines('wordlist.txt', :encoding => 'utf-8').map{|x| x.strip}
$txt = File.read('text.txt', :encoding => 'utf-8')
def solution_count
new_wordlist = []
$wordlist.each{|word|
new_wordlist << word if $txt.count(word) > 0
}
new_wordlist.sort
end
#Faster then count, it can stop after the first hit
def solution_include
new_wordlist = []
$wordlist.each{|word|
new_wordlist << word if $txt.include?(word)
}
new_wordlist.sort
end
def solution_combine()
#get biggest word size
max = 0
$wordlist.each{|word| max = word.size if word.size > max }
#Build list of all letter combination from text
words_in_txt = []
0.upto($txt.size){|i|
1.upto(max){|l|
words_in_txt << $txt[i,l]
}
}
(words_in_txt & $wordlist).sort
end
#Idea behind:
#- remove string if found.
#- the next comparison is faster, the search text is shorter.
#
#This will not work with overlapping words.
#Example:
# abcdef contains def.
# if we check bcd first, the 'd' of def will be deleted, def is not detected.
def solution_gsub
new_wordlist = []
txt = $txt.dup #avoid to manipulate data source for other methods
#We must start with the big words.
#If we start with small one, we destroy long words
$wordlist.sort_by{|x| x.size }.reverse.each{|word|
new_wordlist << word if txt.gsub!(word,'')
}
#Now we must add words which where already part of longer words
new_wordlist.dup.each{|neww|
$wordlist.each{|word|
new_wordlist << word if word != neww and neww.include?(word)
}
}
new_wordlist.sort
end
#Save the result
File.open('wordlist_new.txt', 'w:utf-8'){|f|
#~ f << solution_include.join("\n")
f << solution_combine.join("\n")
}
#Check the different results
if solution_count != solution_include
puts "Difference solution_count <> solution_include"
end
if solution_gsub != solution_include
puts "Difference solution_gsub <> solution_include"
end
if solution_combine != solution_include
puts "Difference solution_combine <> solution_include"
end
#Benchmark the solution
Benchmark.bmbm(10) {|b|
b.report('count') { N.times { solution_count } }
b.report('include') { N.times { solution_include } }
b.report('gsub') { N.times { solution_gsub } } #wrong results
b.report('combine') { N.times { solution_gsub } } #wrong results
} #Benchmark
I think, the solution_gsub
variant is not correct. See the comment in the method definition. If CJK may allow this solution, the please give me a feedback.
That variant is the slowest in my test, but perhaps it will tune up with bigger examples.
And perhaps it can be tuned a bit.
The variant combine
is also very slow, but it would be interestiung what happens with a bigger example.
Upvotes: 4
Reputation: 411
I would probably use Perl;
use strict;
my @aWordList = ();
open(WORDLIST, "< wordlist.txt") || die("Can't open wordlist.txt);
while(my $sWord = <WORDLIST>)
{
chomp($sWord);
push(@aWordList, $sWord);
}
close(WORDLIST);
open(TEXT, "< text.txt") || die("Can't open text.txt);
while(my $sText = <TEXT>)
{
foreach my $sWord (@aWordList)
{
if($sText =~ /$sWord/)
{
print("$sWord\n");
}
}
}
close(TEXT);
This won't be too slow, but if you could let us know the size of the files you're dealing with I could have a go at writing something much more clever with hash tables
Upvotes: 4
Reputation: 2615
Try this:
cat wordlist.txt | while read line
do
if [[ grep -wc $line text.txt
-gt 0 ]]
then
echo $line
fi
done
Whatever you do, if you use grep you must use -w to match a whole word. Otherwise if you have foo in wordlist.txt and foobar in text.txt, you'll get wrong match.
If the files are VERY big, and this loop takes too much time to run, you can convert text.txt to a list of work (easy with AWK), and use comm to find the words that are in both lists.
Upvotes: 3
Reputation: 39158
Use grep with fixed-strings (-F
) semantics, this will be fastest. Similarly, if you want to write it in Perl, use the index
function instead of regex.
sort -u wordlist.txt > wordlist-unique.txt
grep -F -f wordlist-unique.txt text.txt
I'm surprised that there are already four answers, but no one posted this yet. People just don't know their toolbox anymore.
Upvotes: 4
Reputation: 58371
This might work for you:
tr '[:punct:]' ' ' < text.txt | tr -s ' ' '\n' |sort -u | grep -f - wordlist.txt
Basically, create a new word list from text.txt
and grep it against wordlist.txt
file.
N.B. You may want to use the software you used to build the original wordlist.txt
. In which case all you need is:
yoursoftware < text.txt > newwordlist.txt
grep -f newwordlist.txt wordlist.txt
Upvotes: 5
Reputation: 2159
Just use comm
http://unstableme.blogspot.com/2009/08/linux-comm-command-brief-tutorial.html
comm -1 wordlist.txt text.txt
Upvotes: 5
Reputation: 302
Simplest way with bash script:
Preprocessing first with "tr" and "sort" to format it to one word a line and remove duplicated lines.
Do this:
cat wordlist.txt | while read i; do grep -E "^$i$" text.txt; done;
That's the list of words you want...
Upvotes: 3
Reputation: 6800
new file newlist.txt
for each word in wordlist.txt:
check if word is in text.txt (I would use grep, if you're willing to use bash)
if yes:
append it to newlist.txt (probably echo word >> newlist.txt)
if no:
next word
Upvotes: 3