Reputation: 3175
I have some questions about Defaultdict and Counter. I have a situation where I have a text file with one sentence per line. I want to split up the sentence into two (at first space) and store them into a dictionary with the first substring as the key and the second substring as the value. The reason for doing this is so that I can get a total number of sentences that share the same key.
Text file format:
d1 This is an example
id3 Hello World
id1 This is also an example
id4 Hello Hello World
.
.
This is what I have tried but it doesn't work. I have looked at Counter but it's a bit tricky in my situation.
try:
openFileObject = open('test.txt', "r")
try:
with openFileObject as infile:
for line in infile:
#Break up line into two strings at first space
tempLine = line.split(' ' , 1)
classDict = defaultdict(tempLine)
for tempLine[0], tempLine[1] in tempLine:
classDict[tempLine[0]].append(tempLine[1])
#Get the total number of keys
len(classDict)
#Get value for key id1 (should return 2)
finally:
print 'Done.'
openFileObject.close()
except IOError:
pass
Is there a way to do this without splitting up the sentences and storing them as tuples in a huge list before attempting using Counter or defaultdict? Thanks!
EDIT: Thanks to all who answered. I finally found out where I went wrong in this. I edited the program with all the suggestions given by everyone.
openFileObject = open(filename, "r")
tempList = []
with openFileObject as infile:
for line in infile:
tempLine = line.split(' ' , 1)
tempList.append(tempLine)
classDict = defaultdict(list) #My error is here where I used tempLine instead if list
for key, value in tempList:
classDict[key].append(value)
print len(classDict)
print len(classDict['key'])
Upvotes: 3
Views: 325
Reputation: 304335
Using collections.Counter
to "get a total number of sentences that share the same key."
from collections import Counter
with openFileObject as infile:
print Counter(x.split()[0] for x in infile)
will print
Counter({'id1': 2, 'id4': 1, 'id3': 1})
If you want to store a list of all the lines, your main mistake is here
classDict = defaultdict(tempLine)
For this pattern, you should be using
classDict = defaultdict(list)
But there's no point storing all those lines in a list if you're just indenting on taking the length.
Upvotes: 2
Reputation: 8925
Full example of defaultdict (and improved way of displaying classDict)
from collections import defaultdict
classDict = defaultdict(int)
with open('text.txt') as f:
for line in f:
first_word = line.split()[0]
classDict[first_word] += 1
print(len(classDict))
for key, value in classDict.iteritems():
print('{}: {}'.format(key, value))
Upvotes: 1
Reputation: 369224
dict.get(key, 0)
return current accumulated count. If key was not in dict, return 0.
classDict = {}
with open('text.txt') as infile:
for line in infile:
key = line.split(' ' , 1)[0]
classDict[key] = classDict.get(key, 0) + 1
print(len(classDict))
for key in classDict:
print('{}: {}'.format(key, classDict[key]))
http://docs.python.org/3/library/stdtypes.html#dict.get
Upvotes: 1