Bharath
Bharath

Reputation: 1618

Finding a number of unique words in a line of text file

I would like to create a Python program to find the unique words in a line in text file.

The text file "details" has following lines

My name is crazyguy
i am studying in a college and i travel by car
my brother brings me food for eating and we will go for shopping after food.

it must return output as

4
10 #(since i is repeated)
13 #(Since food and for are repeated)

If the code works, will it work the same way for bigger text files in mining the data?

Upvotes: 1

Views: 1231

Answers (3)

Marcus Müller
Marcus Müller

Reputation: 36346

There's a whole world of solutions that are worse than TigerhawkT3's/Vignesh Kalai's solution. For comparison:

>>> timeit.timeit("len(set(string.split()))", "string=\""+string+"\"")
9.243406057357788

is their implementation. I actually had high hopes for this one:

>>> timeit.timeit("len(set(map(hash,string.split())))", "import numpy\nstring=\""+string+"\"")
14.462514877319336

because here, the set is only built over the hashes. (And because the hashes are numbers, they don't need to be hashed themselves, or so I hoped. Type handling in set probably still kills me; otherwise, in theory, the number of hashes calculated would be the same as in the best solution, but there might have been less awkward PyObject juggling underneath. I was wrong.)

So I tried dealing with the hashes in numpy; first with the raw strings, for comparison:

>>> timeit.timeit("len(numpy.unique(string.split()))", "import numpy\nstring=\""+string+"\"")
33.38827204704285
>>> timeit.timeit("len(numpy.unique(map(hash,string.split())))", "import numpy\nstring=\""+string+"\"")
37.22595286369324
>>> timeit.timeit("len(numpy.unique(numpy.array(map(hash,string.split()))))", "import numpy\nstring=\""+string+"\"")
36.20353698730469

Last resort: A Counter might simply circumvent the reduction step. But then again, Python strings are just PyObjects and you really don't gain by having a dict instead of a set:

>>> timeit.timeit("max(Counter(string.split()).values())==1", "from collections import Counter\nstring=\""+string+"\"")
46.88196802139282
>>> timeit.timeit("len(Counter(string.split()))", "from collections import Counter\nstring=\""+string+"\"")
44.15947103500366

By the way: Half of the time of the best solution goes into splitting:

>>> timeit.timeit("string.split()", "import numpy\nstring=\""+string+"\"")
4.552565097808838

and, counter-intuitively, that time even increases if you specify that you only want to split along spaces (and not all typical delimiters):

>>> timeit.timeit("string.split(' ')", "import numpy\nstring=\""+string+"\"")
4.713452100753784

Upvotes: 3

The6thSense
The6thSense

Reputation: 8335

You could use set traverse through all the line split to create lsit and make it to set to find unique value and find it's count

with open("filename","r") as inp:
     for line in inp:
         print len(set(line.split()))

Upvotes: 5

TigerhawkT3
TigerhawkT3

Reputation: 49318

with open('details.txt', 'r') as f:
    for line in f:
        print(len(set(line.split())))

Upvotes: 5

Related Questions