Sheron
Sheron

Reputation: 615

Python column and row interaction in dataframe

Let's imagine I have a dataframe :

question  user   level 
    1      a       1     
    1      b       2     
    1      a       3     
    2      a       1     
    2      b       2     
    2      a       3     
    2      b       4     
    3      c       1     
    3      b       2     
    3      c       3     
    3      a       4     
    3      b       5

Column level specifies who started the topic and who replied to it. If user's level is 1, it means that he asked the question. If user's level is 2, it means that he replied to the user who asked the question. If user's level is 3, it means that he replied to the user whose level is 2 and so on.

I would like to extract a new dataframe that should present a communication between users through question. It should contain three columns: "user source", "user destination" and "reply count". Reply count is a number of times in which User Destination "directly" replied to User Source.

    us_source us_dest reply_count
        a        b       2
        a        c       0
        b        a       0
        b        c       0
        c        a       0
        c        b       1

I tried to find first two columns using this code..

idx_cols = ['question']
std_cols = ['user_x', 'user_y']
df1 = df.merge(df, on=idx_cols)
df2 = df1.loc[f1.user_x != f1.user_y, idx_cols + std_cols]

df2.loc[:, std_cols] = np.sort(df2.loc[:, std_cols])

Does anyone have some suggestions for the third column? Consider "direct" a reply from B to A if and only if B replied at level k to a message of A at level k-1 in the same topic. If a topic is started by student A (by sending a message at level 1), B replied to A (sending a message at level 2), so B directly replied to A. Only students' replies from level 2 to level 1.

Upvotes: 0

Views: 2036

Answers (1)

Trolldejo
Trolldejo

Reputation: 466

My suggestion:

I would use a dictionary containing 'source-destination' as keys and reply_counts as values.

Loop over the first dataframe, for each question, store who posted 1st message as the destination, store who posted 2nd message as source, add a counter in dictionary at key 'source-destination'. eg (not familiar with pandas, I'll let you format it the good way):

from itertools import permutations
reply_counts = {}  # the dictionary where results is going to be stored
users = set()
destination = False  # a simple boolean to make sure message 2 follows message 1

for row in dataframe:  # iterate over the dataframe
    users.add(row[1])  # collect users' name
    if row[2] == 1:  # if it is an initial message
        destination = row[1]  # we store users as destination
    elif row[2] == 2 and destination:  # if this is a second message 
        source = row[1]  # store user as source
        key = source + "-" + destination  # construct a key based on source/destination
        if key not in reply_counts:  # if the key is new to dictionary
            reply_counts[key] = 1  # create the new entry
        else:  # otherwise
            reply_counts[key] += 1  # add a counter to the existing entry
        destination = False  # reset destination

    else:
        destination = False  # reset destination

# add the pairs of source-destination who didn't interact in the dictionnary
for pair in permutations(users, 2):
    if "-".join(pair) not in reply_counts:
        reply_counts["-".join(pair)] = 0

then you can convert your dictionary back into a dataframe.

Upvotes: 2

Related Questions