Reputation: 23
I'm trying to make a function to see if words appear within a certain distance of one another, my code is as follows:
file_cont = [['man', 'once', 'upon', 'time', 'love',
'princess'], ['python', 'code', 'cool', 'uses', 'java'],
['man', 'help', 'test', 'weird', 'love']] #words I want to measure 'distance' between
dat = [{ind: val for val, ind in enumerate(el)} for el in file_cont]
def myfunc(w1, w2, dist, dat):
arr = []
for x in dat:
i1 = x.get(w1)
i2 = x.get(w2)
if (i1 is not None) and (i2 is not None) and (i2 - i1 <= dist ):
arr.append(list(x.keys())[i1:i2+1])
return arr
It works in this instance,
myfunc("man", "love",4, dat) returns [['man', 'once', 'upon', 'time', 'love'], ['man', 'help', 'test', 'weird', 'love']] which is what I want
The problem I have is when I use a much bigger dataset (the elements of file_cont becomes thousands of words), it outputs odd results
For example I know the words 'jon' and 'snow' appear together in at least one instance in one of the elements of file_cont
When I do myfunc('jon','snow',6,dat) I get:
[[], [], ['castle', 'ward'], [], [], []]
something completely out of context, it doesn't mention 'jon' or 'snow'
What is the problem here and how would I go about fixing it?
Upvotes: 0
Views: 49
Reputation: 2098
The problem comes from the fact that your text may contain multiple occurrences of the same word, which you typically observe with larger excerpts.
Here's a minimal working example showing how the function may fail:
new_file = [['man', 'once', 'man', 'time', 'love', 'once']]
data = [{ind: val for val, ind in enumerate(el)} for el in new_file]
def myfunc(w1, w2, dist, dat):
arr = []
for x in dat:
i1 = x.get(w1)
i2 = x.get(w2)
if (i1 is not None) and (i2 is not None) and (i2 - i1 <= dist ):
arr.append(list(x.keys())[i1:i2+1])
return arr
myfunc("man", "love", 4, data)
# > [['time', 'love']]
Notice that here, your dictionary will look like this:
# > [{'man': 2, 'once': 5, 'time': 3, 'love': 4}]
This is because, when creating the dictionary, each new occurence of a word will replace its key in the dictionary with the newly observed (higher) index. Thus, the function myfunc
fails as the keys in the dictionary do not correspond anymore to the indices of the words in the excerpt.
A way to achieve what you want to do could be (for instance):
data = ['man', 'once', 'upon', 'man', 'time', 'love', 'princess', 'man']
w1 = 'man'
w2 = 'love'
dist = 3
def new_func(w1, w2, dist, data):
w1_indices = [i for i, x in enumerate(data) if x == w1]
w2_indices = [i for i, x in enumerate(data) if x == w2]
for i in w1_indices:
for j in w2_indices:
if abs(i-j) < dist:
print(data[min(i, j):max(i, j)+1])
new_func(w1, w2, dist, data)
# > ['man', 'time', 'love']
# > ['love', 'princess', 'man']
With a list of lists like in your case, you can do:
file_cont = [['man', 'once', 'upon', 'time', 'love', 'princess'], ['python', 'code', 'cool', 'uses', 'java'],
['man', 'help', 'test', 'weird', 'love']]
results = [new_func(w1, w2, dist, x) for x in file_cont]
print(results)
# > ['man', 'once', 'upon', 'time', 'love']
# > ['man', 'help', 'test', 'weird', 'love']
Upvotes: 1