Reputation: 19
I have two files .txt and I should compare them and just count common words. What I should get is just a total count of how many words in common 2 different files have. How I can do it? Can you help me? This is the code that I have try, but I need only the total count ex "I have found 125 occurrences" (excluding repetitions)
Upvotes: 0
Views: 70
Reputation: 14537
You need to use intersection of the sets
words1 = { 'a', 'b', 'c' }
words2 = { 'b', 'c', 'd' }
common_words = words1.intersection(words2) # { 'b', 'c' }
print(len(common_words)) # output: 2
The way how you can obtain the sets from your files is in the answer of @Giorgi Imerlishvili.
Upvotes: 0
Reputation: 429
Is python a requirement for this? If you are on UNIX machine, you can very efficiently do this with some bash code:
comm -12 <(
sort verbs.txt ) <(
sort text1.txt ) | uniq | wc -l
the option -12
makes sure you just pick the common words, the two sort
commands are entered in two different subshells, it is then piped into a uniq
command (you include this if you want just unique elements), finally you count the words with the wc -l
command.
Upvotes: 0
Reputation: 1957
for example if you have
verbs.txt
it
hello
world
text.txt
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when
an unknown printer took a galley of type and scrambled it to make
a type specimen book. It has survived not only five centuries, but
also the leap into electronic typesetting, remaining essentially unchanged.
It was popularised in the 1960s with the release of Letraset
sheets containing Lorem Ipsum passages, and more recently with
desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
you can use the given script to Count How much times it words from verbs.txt
occurred in text.txt
import re
pattern = r'\b\S+\b'
res = {}
with open("verbs.txt") as vb:
search_words = set([word.lower() for word in vb.read().split("\n")])
with open("text.txt") as text:
data = text.read()
words = [word.lower() for word in re.findall(pattern, data)]
for word in words:
if word in search_words:
res[word] = res.get(word, 0) + 1
print(res)
Upvotes: 1