Reputation: 24373

What is the fastest way to the delete lines in a file which have no match in a second file?

I have two files, wordlist.txt and text.txt.

The first file, wordlist.txt, contains a huge list of words in Chinese, Japanese, and Korean, e.g.:

你
你们
我

The second file, text.txt, contains long passages, e.g.:

你们要去哪里？
卡拉OK好不好？

I want to create a new word list (wordsfount.txt), but it should only contain those lines from wordlist.txt which are found at least once within text.txt. The output file from the above should show this:

你
你们

"我" is not found in this list because it is never found in text.txt.

I want to find a very fast way to create this list which only contains lines from the first file that are found in the second.

I know a simple way in BASH to check each line in worlist.txt and see if it is in text.txt using grep:

a=1
while read line
do
    c=`grep -c $line text.txt`
    if [ "$c" -ge 1 ]
    then
    echo $line >> wordsfound.txt
    echo "Found" $a
fi
    echo "Not found" $a
    a=`expr $a + 1`
done < wordlist.txt

Unfortunately, as wordlist.txt is a very long list, this process takes many hours. There must be a faster solution. Here is one consideration:

As the files contain CJK letters, they can be thought of as a giant alphabet with about 8,000 letters. So nearly every word share characters. E.g.:

我
我们

Due to this fact, if "我" is never found within text.txt, then it is quite logical that "我们" never appears either. A faster script might perhaps check "我" first, and upon finding that it is not present, would avoid checking every subsequent word contained withing wordlist.txt that also contained within wordlist.txt. If there are about 8,000 unique characters found in wordlist.txt, then the script should not need to check so many lines.

What is the fastest way to create the list containing only those words that are in the first file that are also found somewhere within in the second?

Upvotes: 19

Answers (12)

Kaz

Reputation: 58510

First TXR Lisp solution ( http://www.nongnu.org/txr ):

(defvar tg-hash (hash)) ;; tg == "trigraph"

(unless (= (len *args*) 2)
  (put-line `arguments required: <wordfile> <textfile>`)
  (exit nil))

(defvar wordfile [*args* 0])

(defvar textfile [*args* 1])

(mapcar (lambda (line)
          (dotimes (i (len line))
            (push line [tg-hash [line i..(succ i)]])
            (push line [tg-hash [line i..(ssucc i)]])
            (push line [tg-hash [line i..(sssucc i)]])))
        (file-get-lines textfile))

(mapcar (lambda (word)
          (if (< (len word) 4)
            (if [tg-hash word]
              (put-line word))
            (if (find word [tg-hash [word 0..3]]
                      (op search-str @2 @1))
              (put-line word))))
        (file-get-lines wordfile))

The strategy here is to reduce the corpus of words to a hash table which is indexed on individual characters, digraphs and trigraphs occuring in the lines, associating these fragments with the lines. Then when we process the word list, this reduces the search effort.

Firstly if the word is short, three characters or less (probably common in Chinese words), we can try to get an instant match in the hash table. If no match, word is not in the corpus.

If the word is longer than three characters, we can try to get a match for the first three characters. That gives us a list of lines which contain a match for the trigraph. We can search those lines exhaustively to see which ones of them match the word. I suspect that this will greatly reduce the number of lines that have to be searched.

I would need your data, or something representative thereof, to be able to see what the behavior is like.

Sample run:

$ txr words.tl words.txt text.txt
water
fire
earth
the

$ cat words.txt
water
fire
earth
the
it

$ cat text.txt
Long ago people
believed that the four
elements were
just
water
fire
earth

(TXR reads UTF-8 and does all string manipulation in Unicode, so testing with ASCII characters is valid.)

The use of lazy lists means that we do not store the entire list of 300,000 words, for instance. Although we are using the Lisp mapcar function, the list is being generated on the fly and because we don't keep the reference to the head of the list, it is eligible for garbage collection.

Unfortunately we do have to keep the text corpus in memory because the hash table associates lines.

If that's a problem, the solution could be reversed. Scan all the words, and then process the text corpus lazily, tagging those words which occur. Then eliminate the rest. I will post such a solution also.

Upvotes: 4

Ovid

Reputation: 11677

I grabbed the text of War and Peace from the Gutenberg project and wrote the following script. If prints all words in /usr/share/dict/words which are also in war_and_peace.txt. You can change that with:

perl findwords.pl --wordlist=/path/to/wordlist --text=/path/to/text > wordsfound.txt

On my computer, it takes just over a second to run.

use strict;
use warnings;
use utf8::all;

use Getopt::Long;

my $wordlist = '/usr/share/dict/words';
my $text     = 'war_and_peace.txt';

GetOptions(
    "worlist=s" => \$wordlist,
    "text=s"    => \$text,
);

open my $text_fh, '<', $text
    or die "Cannot open '$text' for reading: $!";

my %is_in_text;
while ( my $line = <$text_fh> ) {
    chomp($line);

    # you will want to customize this line
    my @words = grep { $_ } split /[[:punct:][:space:]]/ => $line;
    next unless @words;

    # This beasty uses the 'x' builtin in list context to assign
    # the value of 1 to all keys (the words)
    @is_in_text{@words} = (1) x @words;
}

open my $wordlist_fh, '<', $wordlist
    or die "Cannot open '$wordlist' for reading: $!";

while ( my $word = <$wordlist_fh> ) {
    chomp($word);
    if ( $is_in_text{$word} ) {
        print "$word\n";
    }
}

And here's my timing:

• [ovid] $ wc -w war_and_peace.txt 
565450 war_and_peace.txt
• [ovid] $ time perl findwords.pl > wordsfound.txt 

real    0m1.081s
user    0m1.076s
sys 0m0.000s
• [ovid] $ wc -w wordsfound.txt 
15277 wordsfound.txt

Upvotes: 12

user1126070

Reputation: 5069

Use paralel processing to speed up the processing.

1) sort & uniq on wordlist.txt, then split it to several files (X) Do some testing, X is equal with your computer cores.

 split -d -l wordlist.txt

2) use xargs -p X -n 1 script.sh x00 > output-x00.txt to process the files in paralel

 find ./splitted_files_dir -type f -name "x*" -print| xargs -p 20 -n 1 -I SPLITTED_FILE script.sh SPLITTED_FILE

3) cat output* > output.txt concatenate output files

This will speed up the processing enough, and you are able to use tools that you could understand. This will ease up the maintinging "cost".

The script almost identical that you used in the first place.

script.sh
FILE=$1
OUTPUTFILE="output-${FILE}.txt"
WORDLIST="wordliist.txt"
a=1
while read line
do
    c=`grep -c $line ${FILE} `
    if [ "$c" -ge 1 ]
    then
    echo $line >> ${OUTPUTFILE}
    echo "Found" $a
fi
    echo "Not found" $a
    a=`expr $a + 1`
done < ${WORDLIST}

Upvotes: 3

pizza

Reputation: 7630

This solution is in perl, maintains your original symantics and uses the optimization you suggested.

#!/usr/bin/perl
@list=split("\n",`sort < ./wordlist.txt | uniq`);
$size=scalar(@list);
for ($i=0;$i<$size;++$i) { $list[$i]=quotemeta($list[$i]);}
for ($i=0;$i<$size;++$i) {
    my $j = $i+1;
    while ($list[$j]=~/^$list[$i]/) {
            ++$j;
    }
    $skip[$i]=($j-$i-1);
}
open IN,"<./text.txt" || die;
@text = (<IN>);
close IN;
foreach $c(@text) {
    for ($i=0;$i<$size;++$i) {
            if ($c=~/$list[$i]/) {
                    $found{$list[$i]}=1;
                    last;
            }
            else {
                    $i+=$skip[$i];
            }
    }
}
open OUT,">wordsfound.txt" ||die;
while ( my ($key, $value) = each(%found) ) {
        print OUT "$key\n";
}
close OUT;
exit;

Upvotes: 3

knut

Reputation: 27845

Quite sure not the fastest solution, but at least a working one (I hope).

This solution needs ruby 1.9, the text file are expected to be UTF-8.

#encoding: utf-8
#Get test data
$wordlist = File.readlines('wordlist.txt', :encoding => 'utf-8').map{|x| x.strip}
$txt = File.read('text.txt', :encoding => 'utf-8')

new_wordlist = []
$wordlist.each{|word|
  new_wordlist << word if $txt.include?(word)
}

#Save the result
File.open('wordlist_new.txt', 'w:utf-8'){|f|
  f << new_wordlist.join("\n")
}

Can you provide a bigger example to make some benchmark on different methods? (Perhaps some test files to download?)

Below a benchmark with four methods.

#encoding: utf-8
require 'benchmark'
N = 10_000 #Number of Test loops

#Get test data
$wordlist = File.readlines('wordlist.txt', :encoding => 'utf-8').map{|x| x.strip}
$txt = File.read('text.txt', :encoding => 'utf-8')

def solution_count
    new_wordlist = []
    $wordlist.each{|word|
      new_wordlist << word if $txt.count(word) > 0
    }
    new_wordlist.sort
end

#Faster then count, it can stop after the first hit
def solution_include
    new_wordlist = []
    $wordlist.each{|word|
      new_wordlist << word if $txt.include?(word)
    }
    new_wordlist.sort
end
def solution_combine()
    #get biggest word size
    max = 0
    $wordlist.each{|word| max = word.size if word.size > max }
    #Build list of all letter combination from text
    words_in_txt = []
    0.upto($txt.size){|i|
      1.upto(max){|l|
        words_in_txt << $txt[i,l]
      }
    }
    (words_in_txt & $wordlist).sort
end
#Idea behind:
#- remove string if found.
#- the next comparison is faster, the search text is shorter.
#
#This will not work with overlapping words.
#Example:
#  abcdef contains def.
#  if we check bcd first, the 'd' of def will be deleted, def is not detected.
def solution_gsub
    new_wordlist = []
    txt = $txt.dup  #avoid to manipulate data source for other methods
    #We must start with the big words.
    #If we start with small one, we destroy  long words
    $wordlist.sort_by{|x| x.size }.reverse.each{|word|
      new_wordlist << word if txt.gsub!(word,'')
    }
    #Now we must add words which where already part of longer words
    new_wordlist.dup.each{|neww|
      $wordlist.each{|word|          
        new_wordlist << word if word != neww and neww.include?(word)
      }
    }
    new_wordlist.sort
end

#Save the result
File.open('wordlist_new.txt', 'w:utf-8'){|f|
  #~ f << solution_include.join("\n")
  f << solution_combine.join("\n")
}

#Check the different results
if solution_count != solution_include
  puts "Difference solution_count <> solution_include"
end
if solution_gsub != solution_include
  puts "Difference solution_gsub <> solution_include"
end
if solution_combine != solution_include
  puts "Difference solution_combine <> solution_include"
end

#Benchmark the solution
Benchmark.bmbm(10) {|b|

  b.report('count') { N.times { solution_count } }
  b.report('include') { N.times { solution_include } }
  b.report('gsub') { N.times { solution_gsub } } #wrong results
  b.report('combine') { N.times { solution_gsub } } #wrong results

} #Benchmark

I think, the solution_gsub variant is not correct. See the comment in the method definition. If CJK may allow this solution, the please give me a feedback. That variant is the slowest in my test, but perhaps it will tune up with bigger examples. And perhaps it can be tuned a bit.

The variant combine is also very slow, but it would be interestiung what happens with a bigger example.

Upvotes: 4

swampf0etus

Reputation: 411

I would probably use Perl;

use strict;

my @aWordList = ();

open(WORDLIST, "< wordlist.txt") || die("Can't open wordlist.txt);

while(my $sWord = <WORDLIST>)
{
   chomp($sWord);
   push(@aWordList, $sWord);
}

close(WORDLIST);

open(TEXT, "< text.txt") || die("Can't open text.txt);

while(my $sText = <TEXT>)
{
   foreach my $sWord (@aWordList)
   {
      if($sText =~ /$sWord/)
      {
          print("$sWord\n");
      }
   }
}


close(TEXT);

This won't be too slow, but if you could let us know the size of the files you're dealing with I could have a go at writing something much more clever with hash tables

Upvotes: 4

Eran Ben-Natan

Reputation: 2615

Try this: cat wordlist.txt | while read line do if [[ grep -wc $line text.txt -gt 0 ]] then echo $line fi done

Whatever you do, if you use grep you must use -w to match a whole word. Otherwise if you have foo in wordlist.txt and foobar in text.txt, you'll get wrong match.

If the files are VERY big, and this loop takes too much time to run, you can convert text.txt to a list of work (easy with AWK), and use comm to find the words that are in both lists.

Upvotes: 3

daxim

Reputation: 39158

Use grep with fixed-strings (-F) semantics, this will be fastest. Similarly, if you want to write it in Perl, use the index function instead of regex.

sort -u wordlist.txt > wordlist-unique.txt
grep -F -f wordlist-unique.txt text.txt

I'm surprised that there are already four answers, but no one posted this yet. People just don't know their toolbox anymore.

Upvotes: 4

potong

Reputation: 58371

This might work for you:

 tr '[:punct:]' ' ' < text.txt | tr -s ' ' '\n' |sort -u | grep -f - wordlist.txt

Basically, create a new word list from text.txt and grep it against wordlist.txt file.

N.B. You may want to use the software you used to build the original wordlist.txt. In which case all you need is:

yoursoftware < text.txt > newwordlist.txt
grep -f newwordlist.txt wordlist.txt

Upvotes: 5

moodywoody

Reputation: 2159

Just use comm

http://unstableme.blogspot.com/2009/08/linux-comm-command-brief-tutorial.html

comm -1 wordlist.txt text.txt

Upvotes: 5

Spike

Reputation: 302

Simplest way with bash script:

Preprocessing first with "tr" and "sort" to format it to one word a line and remove duplicated lines.
Do this:

cat wordlist.txt | while read i; do grep -E "^$i$" text.txt; done;

That's the list of words you want...

Upvotes: 3

FrankieTheKneeMan

Reputation: 6800

new file newlist.txt
for each word in wordlist.txt:
    check if word is in text.txt (I would use grep, if you're willing to use bash)
    if yes:
        append it to newlist.txt (probably echo word >> newlist.txt)
    if no:
        next word

Upvotes: 3

What is the fastest way to the delete lines in a file which have no match in a second file?

Answers (12)

Related Questions