How can I find and count multiple intersections between a list and a text?

Question

I'm currently working on a program in Python to count anglicisms in a German text. I'd like to know how many times anglicisms occur in the whole text. For that i have made a list of all anglicism in the German language which looks like this:

abchecken
abchillen
abdancen
abdimmen
abfall-container
abflug-terminal

And the list goes on... Then I have checked for intersections between this list and the text which is to be analyzed, but that only gives me a list of all words which occur in both texts for example: Anglicisms : 4:{'abdancen', 'abchecken', 'terminal'}

I'd really like the porgram to output how many times those words occur (preferably ordered by frequency) for example:

Anglicisms: abdancen(5), abchecken(2), terminal(1)

This is the code I have so far:

 #counters to zero
 lines, blanklines, sentences, words = 0, 0, 0, 0

 print ('-' * 50)

 while True:
     try:
       #def text file
       filename = input("Please enter filename: ")
       textf = open(filename, 'r')
       break
     except IOError:
       print( 'Cannot open file "%s" ' % filename )

 #reads one line at a time
 for line in textf:
   print( line, )  # test
   lines += 1

   if line.startswith('
'):
     blanklines += 1
   else:
     #sentence ends with . or ! or ?
    #count these characters
     sentences += line.count('.') + line.count('!') + line.count('?')

     #create a list of words
     #use None to split at any whitespace regardless of length
     tempwords = line.split(None)
     print(tempwords)

     #total words
     words += len(tempwords)

 #anglicisms
     words1 = set(open(filename).read().split())
     words2 = set(open("anglicisms.txt").read().split())

     duplicates  = words1.intersection(words2)


 textf.close()
 print( '-' * 50)
 print( "Lines       : ", lines)
 print( "Blank lines : ", blanklines)
 print( "Sentences   : ", sentences)
 print( "Words       : ", words)
 print( "Anglicisms  :  %d:%s"%(len(duplicates),duplicates))

A second problem I have is that it's not counting those anglicisms which are in other words. For example if "big" is in the list of anglicisms and "bigfoot" in the text, this occurrence is being ignored. How can I fix that?

Kind regards from Switzerland!

Mr. E · Accepted Answer

I would do something like this:

from collections import Counter
anglicisms = open("anglicisms.txt").read().split()

matches = []
for line in textf:
    matches.extend([word for word in line.split() if word in anglicisms])

anglicismsInText = Counter(matches)

About second question I find it a bit hard to do. Taking your example "big" is an anglicism and "bigfoot" should match, but what about "Abigail"? or "overbig"? Should it match every time an anglicism is found in the string? at the begining? At the end? Once you know that, you should build a regex that matches it

Edit: To match strings that begin with an anglicism do:

def derivatesFromAnglicism(word):
    return any([word.startswith(a) for a in anglicism])

matches.extend([word for word in line.split() if derivatesFromAnglicism(word)])

How can I find and count multiple intersections between a list and a text?

Answers (2)

Related Questions