Reputation: 285
I have a file where the content looks as follows:
eng word1
eng word2
eng word3
ita word1
ita word2
fra word1
...
I want to count the number of occurrences of each word in every language. For this purpose i want to read the file in a dict. This is my attempt:
data = open('file', 'r', encoding='utf8')
for line in data:
lang = line[:3]
ipa_string = line[3:]
lang_and_string_dict[lang] = []
lang_and_string_dict[lang].append(ipa_string)
print(lang_and_string_dict)
This gives me a dict with the right keys but only the last of the words for example for english:
{'eng':[word1]}
Upvotes: 3
Views: 466
Reputation: 26039
Another workaround would be using collections.Counter
. It returns a count of numbers of words under each category:
from collections import Counter
words = []
with open('file') as f:
for line in f:
words.append(line.split()[0])
print(Counter(words))
# Counter({'eng': 3, 'ita': 2, 'fra': 1})
To get count of each word under each category:
from collections import Counter
words = []
with open('file.txt') as f:
lines = f.readlines()
prev = lines[0].split()[0]
for line in lines:
splitted = line.split()
if splitted[0] != prev:
print('{} -> {}'.format(prev, Counter(words)))
prev = splitted[0]
words = []
words.append(line.split()[1])
print('{} -> {}'.format(prev, Counter(words)))
# eng -> Counter({'word1': 1, 'word2': 1, 'word3': 1})
# ita -> Counter({'word1': 1, 'word2': 1})
# fra -> Counter({'word1': 1})
Upvotes: 1
Reputation: 164613
Similar solution to @shahaf's, but using defaultdict(int)
instead of Counter
.
I also use csv.DictReader
to make the logic clearer.
from collections import defaultdict
import csv
from io import StringIO
mystr = StringIO("""eng word1
eng word2
eng word3
eng word1
ita word1
ita word2
ita word2
fra word1""")
d = defaultdict(lambda: defaultdict(int))
# replace mystr with open('file.csv', 'r')
with mystr as fin:
reader = csv.DictReader(fin, delimiter=' ', fieldnames=['language', 'word'])
for line in reader:
d[line['language']][line['word']] += 1
print(d)
defaultdict({'eng': defaultdict(int, {'word1': 2, 'word2': 1, 'word3': 1}),
'ita': defaultdict(int, {'word1': 1, 'word2': 2}),
'fra': defaultdict(int, {'word1': 1})})
Upvotes: 0
Reputation: 4973
a simple approach using dict where keys are lang and values are counters of word occurrences
from collections import Counter, defaultdict
lang_and_string_dict = defaultdict(Counter)
with open('file', 'r', encoding='utf8') as f:
for line in f:
lang, word = line.split()
lang_and_string_dict[lang].update([word])
print(lang_and_string_dict)
output
defaultdict(<class 'collections.Counter'>, {'eng': Counter({'word1': 1, 'word2': 1, 'word3': 1}), 'ita': Counter({'word1': 1, 'word2': 1}), 'fra': Counter({'word1': 1})})
Keep in mind the line lang, word = line.split()
can cause an error or unexpected behaviour if the lines in the file aren't in exact lang word
format, a exception and check is suggested
Upvotes: 3
Reputation: 476493
Well each time you assign an empty list as value:
data = open('file', 'r', encoding='utf8')
for line in data:
lang = line[:3]
ipa_string = line[3:]
lang_and_string_dict[lang] = []
lang_and_string_dict[lang].append(ipa_string)
print(lang_and_string_dict)
As a result, the old list containing the previous occurrence is lost. You should only create a list if no such element exists already, like:
data = open('file', 'r', encoding='utf8')
for line in data:
lang = line[:3]
ipa_string = line[3:]
if lang not in lang_and_string_dict:
lang_and_string_dict[lang] = []
lang_and_string_dict[lang].append(ipa_string)
print(lang_and_string_dict)
Since this pattern is rather common, you can use a defaultdict
as well:
from collection import defaultdict
lang_and_string_dict = defaultdict(list)
with open('file', 'r', encoding='utf8') as data:
for line in data:
lang = line[:3]
ipa_string = line[3:]
lang_and_string_dict[lang].append(ipa_string)
print(lang_and_string_dict)
A defaultdict
is a subclass of dict
that uses a factory (here list
) in case a key is missing. So each time a key is queried that is not in the dictionary, we construct a list
.
You can later convert such defaultdict
to a dict
with dict(lang_and_string_dict)
.
Furthermore if you open(..)
files, you better do this with a with
block. Since if an exception for example arises, then the file is still properly closed.
Upvotes: 2