Reputation: 19

Compare 2 txt and get total count excluding repetitions

I have two files .txt and I should compare them and just count common words. What I should get is just a total count of how many words in common 2 different files have. How I can do it? Can you help me? This is the code that I have try, but I need only the total count ex "I have found 125 occurrences" (excluding repetitions)

Upvotes: 0

Answers (3)

Yuri Khristich

Reputation: 14537

You need to use intersection of the sets

words1 = { 'a', 'b', 'c' }
words2 = { 'b', 'c', 'd' }

common_words = words1.intersection(words2) # { 'b', 'c' }

print(len(common_words)) # output: 2

The way how you can obtain the sets from your files is in the answer of @Giorgi Imerlishvili.

Upvotes: 0

Saverio Guzzo

Reputation: 429

Is python a requirement for this? If you are on UNIX machine, you can very efficiently do this with some bash code:

comm -12 <(
sort verbs.txt ) <(
sort text1.txt ) | uniq | wc -l

the option -12 makes sure you just pick the common words, the two sort commands are entered in two different subshells, it is then piped into a uniq command (you include this if you want just unique elements), finally you count the words with the wc -l command.

Upvotes: 0

George Imerlishvili

Reputation: 1957

for example if you have
verbs.txt

it
hello
world

text.txt

Lorem Ipsum is simply dummy text of the printing and typesetting industry. 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when 
an unknown printer took a galley of type and scrambled it to make 
a type specimen book. It has survived not only five centuries, but
also the leap into electronic typesetting, remaining essentially unchanged. 
It was popularised in the 1960s with the release of Letraset 
sheets containing Lorem Ipsum passages, and more recently with 
desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

you can use the given script to Count How much times it words from verbs.txt occurred in text.txt

import re

pattern = r'\b\S+\b'

res = {}

with open("verbs.txt") as vb:
    search_words = set([word.lower() for word in vb.read().split("\n")])

with open("text.txt") as text:
    data = text.read()
    words = [word.lower() for word in re.findall(pattern, data)]



for word in words:
    if word in search_words:
        res[word] = res.get(word, 0) + 1

print(res)

Upvotes: 1

Compare 2 txt and get total count excluding repetitions

Answers (3)

Related Questions