boltthrower
boltthrower

Reputation: 1250

Process a list in a Dataframe column

I created a DataFrame neighbours using sim_measure_i which is also a DataFrame.

neighbours= sim_measure_i.apply(lambda s: s.nlargest(k).index.tolist(), axis =1)

neighbours looks like this:

1500                       [0, 1, 2, 3, 4]
1501                       [0, 1, 2, 3, 4]
1502                       [0, 1, 2, 3, 4]
1503     [7230, 12951, 13783, 8000, 18077]
1504                     [1, 3, 6, 27, 47]

The second column here has lists - I want to iterate over this DataFrame and work on the list such that I can read each element in the list - say 7230 and lookup a score for 7230 in another DataFrameI have which contains (id, score).

I would then like to add a column to this DataFrame such that it looks like

test_case_id               nbr_list             scores             
1500                       [0, 1, 2, 3, 4]        [+1, -1, -1, +1, -1]
1501                       [0, 1, 2, 3, 4]        [+1, +1, +1, -1, -1]
1502                       [0, 1, 2, 3, 4]        [+1, +1, +1, -1, -1]
1503     [7230, 12951, 13783, 8000, 18077]        [+1, +1, +1, -1, -1]
1504                     [1, 3, 6, 27, 47]        [+1, +1, +1, -1, -1]

Edit: I've written a method get_scores()

def get_scores(list_of_neighbours):
    score_matrix = []
    for x, val in enumerate(list_of_neighbours):
        score_matrix.append(df.iloc[val].score)
    return score_matrix

When I try to use lambda on each of nbr_list, I get this error:

TypeError: ("cannot do positional indexing on <class 'pandas.indexes.numeric.Int64Index'> with these indexers [0] of <type 'str'>", u'occurred at index 1500')

The code causing this issue:

def nearest_neighbours(similarity_matrix, k):
    neighbours = pd.DataFrame(similarity_matrix.apply(lambda s: s.nlargest(k).index.tolist(), axis =1))
    neighbours = neighbours.rename(columns={0 : 'nbr_list'})

    nbr_scores = neighbours.apply(lambda l: get_scores(l.nbr_list), axis=1)

    print neighbours

Upvotes: 1

Views: 1109

Answers (3)

Ami Tavory
Ami Tavory

Reputation: 76297

Say you start with neighbors looking like this.

In [87]: neighbors = pd.DataFrame({'neighbors_list': [[0, 1, 2, 3, 4], [0, 1, 2, 3, 4]]})

In [88]: neighbors
Out[88]: 
    neighbors_list
0  [0, 1, 2, 3, 4]
1  [0, 1, 2, 3, 4]

You didn't specify exactly how the other DataFrame (containing the id-score pairs looks), so here is an approximation.

In [89]: id_score = pd.DataFrame({'id': [0, 1, 2, 3, 4], 'score': [1, -1, -1, 1, -1]})

In [90]: id_score
Out[90]: 
   id  score
0   0      1
1   1     -1
2   2     -1
3   3      1
4   4     -1

You can convert this to a dictionary:

In [91]: d = id_score.set_index('id')['score'].to_dict()

And then apply:

In [92]: neighbors.neighbors_list.apply(lambda l: [d[e] for e in l])
Out[92]: 
0    [1, -1, -1, 1, -1]
1    [1, -1, -1, 1, -1]
Name: neighbors_list, dtype: object

Upvotes: 1

Harshavardhan Ramanna
Harshavardhan Ramanna

Reputation: 738

You can try a nested loop:

for i in range(neighbours.shape[0]): #iterate over each row
    for j in range(len(neighbours['neighbours_lists'].iloc[i])): #iterate over each element of the list
        a = neighbours['neighbours_lists'].iloc[i][j] #access the element of the list index j in cell location of row i 

where i is the outer loop variable which iterates over each row and j is the inner loop variable which iterates over the length of the list inside each cell.

Upvotes: 1

Nehal J Wani
Nehal J Wani

Reputation: 16619

Original Data Frame:

In [68]: df
Out[68]: 
   test_case_id                   neighbours_lists
0          1500                    [0, 1, 2, 3, 4]
1          1501                    [0, 1, 2, 3, 4]
2          1502                    [0, 1, 2, 3, 4]
3          1503  [7230, 12951, 13783, 8000, 18077]
4          1504                  [1, 3, 6, 27, 47]

Custom function which takes id and list and does some computation to evaluate score:

In [69]: def g(_id, nbs):
    ...:     return ['-1' if (_id + 1) % (nb + 1) else '+1' for nb in nbs]
    ...: 

Apply custom function to all rows of original data frame:

In [70]: scores = df.apply(lambda x: g(x.test_case_id, x.neighbours_lists), axis=1)

Convert the scores series to a data frame and concat it with the original data frame:

In [71]: df = pd.concat([df, scores.to_frame(name='scores')], 1)

In [72]: df
Out[72]: 
   test_case_id                   neighbours_lists                scores
0          1500                    [0, 1, 2, 3, 4]  [+1, -1, -1, -1, -1]
1          1501                    [0, 1, 2, 3, 4]  [+1, +1, -1, -1, -1]
2          1502                    [0, 1, 2, 3, 4]  [+1, -1, +1, -1, -1]
3          1503  [7230, 12951, 13783, 8000, 18077]  [-1, -1, -1, -1, -1]
4          1504                  [1, 3, 6, 27, 47]  [-1, -1, +1, -1, -1]

Upvotes: 1

Related Questions