Bagus Trihatmaja
Bagus Trihatmaja

Reputation: 865

How to do array operation of a cell in Pandas

Basically, my dataframe looks like this:

id   |   refers 
----------------
1    |   [2,3]
2    |   [1,3]
3    |   []

I want to add another column which show how many times that id is referred by another id. For example:

id   |   refers  |  referred_count
----------------------------------
1    |   [2,3]   |   1
2    |   [1,3]   |   1
3    |   []      |   2

My current code looks like this:

citations_dict = {}
for index, row in data_ref.iterrows():
    if len(row['reference_list']) > 0:
        for reference in row['reference_list']:
            if reference not in citations_dict:
                citations_dict[reference] = {}
                d = data_ref.loc[data_ref['id'] == reference]
                citations_dict[reference]['venue'] = d['venue']
                citations_dict[reference]['reference'] = d['reference']
                citations_dict[reference]['citation'] = 1
            else:
                citations_dict[reference]['citation'] += 1

The problem is that, this code takes so long. I am wondering how to do it faster, maybe using pandas?

Upvotes: 3

Views: 444

Answers (3)

Chris Adams
Chris Adams

Reputation: 18647

First create a helper Series using numpy.hstack and Series.value_counts.

This will be the values of your column 'referred_count' with id as the index.

Then you can reset_index of df to id for easy merge of this series, and finally reset_index to get DataFrame back to original shape.

s = pd.Series(np.hstack(df['refers'])).value_counts()
df.set_index('id').assign(referred_count=s).reset_index()

[out]

   id  refers  referred_count
0   1  [2, 3]               1
1   2  [1, 3]               1
2   3      []               2

Upvotes: 1

Raunaq Jain
Raunaq Jain

Reputation: 917

Data

df = pd.DataFrame({'id': [1,2,3], 'refers': [[1,2,3], [1,3], []]})
    id  refers     referred_count
0   1   [1, 2, 3]   1
1   2   [1, 3]      1
2   3   []          2

Create a dictionary of the number of occurrences of refers:

refer_count = df.refers.apply(pd.Series).stack()\
                .reset_index(drop=True)\
                .astype(int)\
                .value_counts()\
                .to_dict()

Subtract the refer in each id by its refer_count:

df['referred_count'] = df.apply(lambda x: refer_count[x['id']] - x['refers'].count(x['id']), axis = 1)

Output:

    id  refers    referred_count
0   1   [1, 2, 3]  1
1   2   [1, 3]     1
2   3   []         2

Upvotes: 1

Chandu
Chandu

Reputation: 2129

Step 1: Get the count of each ID in the refers column and store it in a dictionary and apply the function on creating new column.

import pandas as pd
from collections import Counter

df = pd.DataFrame({'id':[1,2,3],'refers':[[2,3],[1,3],[]]})
counter = dict(Counter([item for sublist in df['refers'] for item in sublist]))
df['refer_counts'] = df['id'].apply(lambda x: counter[x])

output

   id  refers  refer_counts
0   1  [2, 3]             1
1   2  [1, 3]             1
2   3      []             2

Think it's exactly what you needed!

Upvotes: 0

Related Questions