Reputation: 477
I'm currently working on a program in Python to count anglicisms in a German text. I'd like to know how many times anglicisms occur in the whole text. For that i have made a list of all anglicism in the German language which looks like this:
abchecken
abchillen
abdancen
abdimmen
abfall-container
abflug-terminal
And the list goes on...
Then I have checked for intersections between this list and the text which is to be analyzed, but that only gives me a list of all words which occur in both texts for example: Anglicisms : 4:{'abdancen', 'abchecken', 'terminal'}
I'd really like the porgram to output how many times those words occur (preferably ordered by frequency) for example:
Anglicisms: abdancen(5), abchecken(2), terminal(1)
This is the code I have so far:
#counters to zero
lines, blanklines, sentences, words = 0, 0, 0, 0
print ('-' * 50)
while True:
try:
#def text file
filename = input("Please enter filename: ")
textf = open(filename, 'r')
break
except IOError:
print( 'Cannot open file "%s" ' % filename )
#reads one line at a time
for line in textf:
print( line, ) # test
lines += 1
if line.startswith('\n'):
blanklines += 1
else:
#sentence ends with . or ! or ?
#count these characters
sentences += line.count('.') + line.count('!') + line.count('?')
#create a list of words
#use None to split at any whitespace regardless of length
tempwords = line.split(None)
print(tempwords)
#total words
words += len(tempwords)
#anglicisms
words1 = set(open(filename).read().split())
words2 = set(open("anglicisms.txt").read().split())
duplicates = words1.intersection(words2)
textf.close()
print( '-' * 50)
print( "Lines : ", lines)
print( "Blank lines : ", blanklines)
print( "Sentences : ", sentences)
print( "Words : ", words)
print( "Anglicisms : %d:%s"%(len(duplicates),duplicates))
A second problem I have is that it's not counting those anglicisms which are in other words. For example if "big" is in the list of anglicisms and "bigfoot" in the text, this occurrence is being ignored. How can I fix that?
Kind regards from Switzerland!
Upvotes: 3
Views: 307
Reputation: 2130
I would do something like this:
from collections import Counter
anglicisms = open("anglicisms.txt").read().split()
matches = []
for line in textf:
matches.extend([word for word in line.split() if word in anglicisms])
anglicismsInText = Counter(matches)
About second question I find it a bit hard to do. Taking your example "big" is an anglicism and "bigfoot" should match, but what about "Abigail"? or "overbig"? Should it match every time an anglicism is found in the string? at the begining? At the end? Once you know that, you should build a regex that matches it
Edit: To match strings that begin with an anglicism do:
def derivatesFromAnglicism(word):
return any([word.startswith(a) for a in anglicism])
matches.extend([word for word in line.split() if derivatesFromAnglicism(word)])
Upvotes: 1
Reputation: 64
This solve your first question:
anglicisms = ["a", "b", "c"]
words = ["b", "b", "b", "a", "a", "b", "c", "a", "b", "c", "c", "c", "c"]
results = map(lambda angli: (angli, words.count(angli)), anglicisms)
results.sort(key=lambda p:-p[1])
results looks like this:
[('b', 5), ('c', 5), ('a', 3)]
For your second question, i think that the right way is to use regular expresions.
Upvotes: 0