Reputation: 3158
I'm trying to do tokenization of words in a text file using python 3.5 but have a couple of errors. Here is the code:
import re
f = open('/Users/Half_Pint_Boy/Desktop/sentenses.txt', 'r')
a=0
c=0
for line in f:
b=re.split('[^a-z]', line.lower())
a+=len(filter(None, b))
c = c + 1
d = d + b
print (a)
print (c)
My questions:
Construction a+=len(filter(None, b))
works fine in python 2.7 but in 3.5 it cause an error of type that object of:
type 'filter' has no
len()
How can it be solved using python 3.5?
When I'm doing tokenization, my code counts also empty spaces as word-tokens. How can I delete them?
Thanks!
Upvotes: 0
Views: 348
Reputation: 78564
You need an explicit cast to list in Python 3.5 to get the length of your sequence, as filter
returns an iterator object and not a list as with Python 2.7:
a += len(list(filter(None, b)))
# ^^
The empty spaces where returned from your re.split
, e.g.:
>>> line = 'sdksljd sdjsh 1213hjs sjdks'
>>> b=re.split('[^a-z]', line.lower())
>>> b
['sdksljd', 'sdjsh', '', '', '', '', 'hjs', 'sjdks']
You can remove them using a filter on if
in a list comprehension on the results from your re.split
like so:
b = [i for i in re.split('[^a-z]', line.lower()) if i]
The if i
part in the list comp. returns False
for an empty string because bool('') is False
. So empty strings are cleared.
The results from the list comprehension can also be achieved with filter
(which you already used with a
):
b = list(filter(None, re.split('[^a-z]', line.lower()))) # use the list comprehension if you don't like brackets
And finally, a
can be computed after any of the two approaches as:
a += len(b)
Upvotes: 1