Reputation: 883
I have two very large lists(that's why I used ...
), a list of lists:
x = [['I like stackoverflow. Hi ok!'],['this is a great community'],['Ok, I didn\'t like this!.'],...,['how to match and return the frequency?']]
and a list of strings:
y = ['hi', 'nice', 'ok',..., 'frequency']
I would like to return in a new list the times (count) that any word in y
occurred in all the lists of x
. For example, for the above lists, this should be the correct output:
[(1,2),(2,0),(3,1),...,(n,count)]
As follows, [(1,count),...,(n,count)]
. Where n
is the number of the list and count
the number of times that any word from y
appeared in x
. Any idea of how to approach this?.
Upvotes: 0
Views: 161
Reputation: 1156
INPUT:
x = [['I like stackoverflow. Hi ok!'],['this is a great community'],
['Ok, I didn\'t like this!.'],['how to match and return the frequency?']]
y = ['hi', 'nice', 'ok', 'frequency']
CODE:
import re
s1 = set(y)
index = 0
result = []
for itr in x:
itr = re.sub('[!.?]', '',itr[0].lower()).split(' ')
# remove special chars and convert to lower case
s2 = set(itr)
intersection = s1 & s2
#find intersection of common strings
num = len(intersection)
result.append((index,num))
index = index+1
OUTPUT:
result = [(0, 2), (1, 0), (2, 1), (3, 1)]
Upvotes: 2
Reputation: 2667
Maybe you could concatenate the strings in x
to make the computation easy:
w = ' '.join(i[0] for i in x)
Now w
is a long string like this:
>>> w
"I like stackoverflow. Hi ok! this is a great community Ok, I didn't like this!. how to match and return the frequency?"
With this conversion, you can simply do this:
>>> l = []
>>> for i in range(len(y)):
l.append((i+1, w.count(str(y[i]))))
which gives you:
>>> l
[(1, 2), (2, 0), (3, 1), (4, 0), (5, 1)]
Upvotes: 1
Reputation: 174706
You could do like this also.
>>> x = [['I like stackoverflow. Hi ok!'],['this is a great community'],['Ok, I didn\'t like this!.'],['how to match and return the frequency?']]
>>> y = ['hi', 'nice', 'ok', 'frequency']
>>> l = []
>>> for i,j in enumerate(x):
c = 0
for x in y:
if re.search(r'(?i)\b'+x+r'\b', j[0]):
c += 1
l.append((i+1,c))
>>> l
[(1, 2), (2, 0), (3, 1), (4, 1)]
(?i)
will do a case-insensitive match. \b
called word boundaries which matches between a word character and a non-word character.
Upvotes: 1
Reputation: 881695
First, you should preprocess x
into a list of sets of lowercased words -- that will speed up the following lookups enormously. E.g:
ppx = []
for subx in x:
ppx.append(set(w.lower() for w in re.finditer(r'\w+', subx))
(yes, you could collapse this into a list comprehension, but I'm aiming for some legibility).
Next, you loop over y
, checking how many of the sets in ppx
contain each item of y
-- that would be
[sum(1 for s in ppx if w in s) for w in y]
That doesn't give you those redundant first items you crave, but enumerate
to the rescue...:
list(enumerate((sum(1 for s in ppx if w in s) for w in y), 1))
should give exactly what you require.
Upvotes: 3
Reputation: 1
You can make a dictionary where key is each item in the "Y" List. Loop through the values of the keys and look up for them in the dictionary. Keep updating the value as soon as you encounter the word into your X nested list.
Upvotes: 0
Reputation: 4912
Here is a more readable solution. Check my comments in the code.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
x = [['I like stackoverflow. Hi ok!'],['this is a great community'],['Ok, I didn\'t like this!.'],['how to match and return the frequency?']]
y = ['hi', 'nice', 'ok', 'frequency']
assert len(x)==len(y), "you have to make sure length of x equals y's"
num = []
for i in xrange(len(y)):
# lower all the strings in x for comparison
# find all matched patterns in x and count it, and store result in variable num
num.append(len(re.findall(y[i], x[i][0].lower())))
res = []
# use enumerate to give output in format you want
for k, v in enumerate(num):
res.append((k,v))
# here is what you want
print res
OUTPUT:
[(0, 1), (1, 0), (2, 1), (3, 1)]
Upvotes: 2