David R
David R

Reputation: 23

Function works for small samples but not larger ones (Python)

I'm trying to make a function to see if words appear within a certain distance of one another, my code is as follows:



file_cont = [['man', 'once', 'upon', 'time', 'love', 
'princess'], ['python', 'code', 'cool', 'uses', 'java'],
['man', 'help', 'test', 'weird', 'love']] #words I want to measure 'distance' between

dat = [{ind: val for val, ind in enumerate(el)} for el in file_cont]

def myfunc(w1, w2, dist, dat):
    arr = []
    for x in dat:
        i1 = x.get(w1)
        i2 = x.get(w2)
        if (i1 is not None) and (i2 is not None) and (i2 - i1 <= dist ):    
            arr.append(list(x.keys())[i1:i2+1])
    return arr

It works in this instance,

myfunc("man", "love",4, dat) returns [['man', 'once', 'upon', 'time', 'love'], ['man', 'help', 'test', 'weird', 'love']] which is what I want

The problem I have is when I use a much bigger dataset (the elements of file_cont becomes thousands of words), it outputs odd results

For example I know the words 'jon' and 'snow' appear together in at least one instance in one of the elements of file_cont

When I do myfunc('jon','snow',6,dat) I get:

[[], [], ['castle', 'ward'], [], [], []]

something completely out of context, it doesn't mention 'jon' or 'snow'

What is the problem here and how would I go about fixing it?

Upvotes: 0

Views: 49

Answers (1)

bglbrt
bglbrt

Reputation: 2098

The problem comes from the fact that your text may contain multiple occurrences of the same word, which you typically observe with larger excerpts.

Here's a minimal working example showing how the function may fail:

new_file = [['man', 'once', 'man', 'time', 'love', 'once']]
data = [{ind: val for val, ind in enumerate(el)} for el in new_file]

def myfunc(w1, w2, dist, dat):
    arr = []
    for x in dat:
        i1 = x.get(w1)
        i2 = x.get(w2)
        if (i1 is not None) and (i2 is not None) and (i2 - i1 <= dist ):    
            arr.append(list(x.keys())[i1:i2+1])
    return arr

myfunc("man", "love", 4, data)
# > [['time', 'love']]

Notice that here, your dictionary will look like this:

# > [{'man': 2, 'once': 5, 'time': 3, 'love': 4}]

This is because, when creating the dictionary, each new occurence of a word will replace its key in the dictionary with the newly observed (higher) index. Thus, the function myfunc fails as the keys in the dictionary do not correspond anymore to the indices of the words in the excerpt.


A way to achieve what you want to do could be (for instance):

data = ['man', 'once', 'upon', 'man', 'time', 'love', 'princess', 'man']
w1 = 'man'
w2 = 'love'
dist = 3

def new_func(w1, w2, dist, data):

    w1_indices = [i for i, x in enumerate(data) if x == w1]
    w2_indices = [i for i, x in enumerate(data) if x == w2]

    for i in w1_indices:
        for j in w2_indices:
            if abs(i-j) < dist:
                print(data[min(i, j):max(i, j)+1])
                
new_func(w1, w2, dist, data)
# > ['man', 'time', 'love']
# > ['love', 'princess', 'man']

With a list of lists like in your case, you can do:

file_cont = [['man', 'once', 'upon', 'time', 'love', 'princess'], ['python', 'code', 'cool', 'uses', 'java'],
['man', 'help', 'test', 'weird', 'love']]

results = [new_func(w1, w2, dist, x) for x in file_cont]
print(results)
# > ['man', 'once', 'upon', 'time', 'love']
# > ['man', 'help', 'test', 'weird', 'love']

Upvotes: 1

Related Questions