Reputation: 1618
I would like to create a Python program to find the unique words in a line in text file.
The text file "details" has following lines
My name is crazyguy
i am studying in a college and i travel by car
my brother brings me food for eating and we will go for shopping after food.
it must return output as
4
10 #(since i is repeated)
13 #(Since food and for are repeated)
If the code works, will it work the same way for bigger text files in mining the data?
Upvotes: 1
Views: 1231
Reputation: 36346
There's a whole world of solutions that are worse than TigerhawkT3's/Vignesh Kalai's solution. For comparison:
>>> timeit.timeit("len(set(string.split()))", "string=\""+string+"\"")
9.243406057357788
is their implementation. I actually had high hopes for this one:
>>> timeit.timeit("len(set(map(hash,string.split())))", "import numpy\nstring=\""+string+"\"")
14.462514877319336
because here, the set
is only built over the hashes. (And because the hashes are numbers, they don't need to be hashed themselves, or so I hoped. Type handling in set
probably still kills me; otherwise, in theory, the number of hashes calculated would be the same as in the best solution, but there might have been less awkward PyObject juggling underneath. I was wrong.)
So I tried dealing with the hashes in numpy; first with the raw strings, for comparison:
>>> timeit.timeit("len(numpy.unique(string.split()))", "import numpy\nstring=\""+string+"\"")
33.38827204704285
>>> timeit.timeit("len(numpy.unique(map(hash,string.split())))", "import numpy\nstring=\""+string+"\"")
37.22595286369324
>>> timeit.timeit("len(numpy.unique(numpy.array(map(hash,string.split()))))", "import numpy\nstring=\""+string+"\"")
36.20353698730469
Last resort: A Counter might simply circumvent the reduction step. But then again, Python strings are just PyObjects and you really don't gain by having a dict
instead of a set
:
>>> timeit.timeit("max(Counter(string.split()).values())==1", "from collections import Counter\nstring=\""+string+"\"")
46.88196802139282
>>> timeit.timeit("len(Counter(string.split()))", "from collections import Counter\nstring=\""+string+"\"")
44.15947103500366
By the way: Half of the time of the best solution goes into splitting:
>>> timeit.timeit("string.split()", "import numpy\nstring=\""+string+"\"")
4.552565097808838
and, counter-intuitively, that time even increases if you specify that you only want to split along spaces (and not all typical delimiters):
>>> timeit.timeit("string.split(' ')", "import numpy\nstring=\""+string+"\"")
4.713452100753784
Upvotes: 3
Reputation: 8335
You could use set traverse through all the line split to create lsit and make it to set to find unique value and find it's count
with open("filename","r") as inp:
for line in inp:
print len(set(line.split()))
Upvotes: 5
Reputation: 49318
with open('details.txt', 'r') as f:
for line in f:
print(len(set(line.split())))
Upvotes: 5