Reputation: 277
x.txt,simple text file(around 1 MB) y.txt dictionary file(around 1Lakh words).
Need to find whether any of the word/s in y.txt is present in x.txt.
Need an algorithm which consumes less time for execution and language preferred for the same.
PS: Please suggest any algorithm apart from BRUTE FORCE METHOD.
I need pattern matching rather than exact string matching.
For instance :
x.txt : "The Old Buzzards were disestablished on 27 April"
Y.txt : "establish"
Output should be : Found establish in X.txt : Line 1
Thank you.
Upvotes: 1
Views: 188
Reputation: 1855
It is not clear to me whether you need this to get a job done or it is home work. If you need it to get a job done then:
#!/usr/bin/bash
Y=`cat y.txt | tr '\n' '|'`
echo "${Y%|}"
grep -E "${Y%|}" x.txt
if [ "$?" -eq 0 ]
then
echo "found"
else
echo "no luck"
fi
is hard to beat as you slurp in all the patterns from a file, construct a regular expression (the echo shows the regex) and then hand it to grep
which constructs a finite state automata for you. That is going to fly as it compares every character in the text at most once. If it is homework then I suggest you consult Cormen et al 'Introduction to Algorithms', or the first few chapters of the Dragon Book which will also explain what I just said.
Forgot to add: y.txt should contain your pattern one per line, but as a nice side effect your patterns do not have to be single words.
Upvotes: 2
Reputation: 891
This is a search algorithm I had in mind for a while. Basically the algorithm is in two steps.
In the first step all the words from y.txt are inserted in a tree. Every path in the tree from the root to a leaf is a word. The leaf is empty.
For example, the tree for the words dog and day is the following.
<root>--<d>-<a>-<y>-<>
\-<o>-<g>-<>
The second part of the algorithm is a search down the tree. When you reach an empty leaf then you have found a word.
The implementation in Groovy, if more comments are needed just ask
//create a tree to store the words in a compact and fast to search way
//each path of the tree from root to an empty leaf is a word
def tree = [:]
new File('y.txt').eachLine{ word->
def t=tree
word.each{ c ->
if(!t[c]){
t[c]=[:]
}
t=t[c]
}
t[0]=0//word terminator (the leaf)
}
println tree//for debug purpose
//search for the words in x.txt
new File('x.txt').eachLine{ str, line->
for(int i=0; i<str.length(); i++){
if(tree[str[i]]){
def t=tree[str[i]]
def res=str[i]
def found=false
for(int j=i+1; j<str.length(); j++){
if(t[str[j]]==null){
if(found){
println "Found $res at line $line, col $i"
res=str[j]
found=false
}
break
}else if(t[str[j]][0]==0){
found=true
res+=str[j]
t=t[str[j]]
continue
}else{
t=t[str[j]]
res+=str[j]
}
found=false
}
if(found) println "Found $res at line $line, col $i"//I know, an ugly repetition, it's for words at the end of a line. I will fix this later
}
}
}
this is my y.txt
dog
day
apple
daydream
and x.txt
This is a beautiful day and I'm walking with my dog while eating an apple.
Today it's sunny.
It's a daydream
The output is the following:
$ groovy search.groovy
[d:[o:[g:[0:0]], a:[y:[0:0, d:[r:[e:[a:[m:[0:0]]]]]]]], a:[p:[p:[l:[e:[0:0]]]]]]
Found day at line 1, col 20
Found dog at line 1, col 48
Found apple at line 1, col 68
Found day at line 2, col 2
Found daydream at line 3, col 7
This algorithm should be fast because the depth of the tree doesn't depend on the number of words in y.txt. The depth is equal to the length of the longest word in y.txt.
Upvotes: 0
Reputation: 34282
Suppose, you have any Set implementation in your standard library, here is some pseudo-code:
dictionary = empty set
def populate_dict():
for word in dict_file:
add(dictionary, word)
def validate_text(text_file):
for word in text_file: ### O(|text_file|)
if word in dictionary: ### O(log |dictonary|)
report(word)
populate_dict()
every_now_and_then(populate_dict)
That would give you O(t * log d)
instead of the brute-force O(t * d)
where t
and d
are the lengths of the input text file and dictionary respectively. I don't think that anything faster is possible since you can't read the file faster that O(t)
and can't search faster than O(log d)
.
Upvotes: 0