Python interaction between columns and rows

Question

I have the following dataframe:

      topic  student level week
        1      a       1     1
        1      b       2     1
        1      a       3     1
        2      a       1     2
        2      b       2     2
        2      a       3     2
        2      b       4     2
        3      c       1     2
        3      b       2     2
        3      c       3     2
        3      a       4     2
        3      b       5     2

It contains a column level that specifies who started the topic and who replied to it. If student's level is 1, it means that he asked the question. If student's level is 2, it means that he replied to the student who asked the question. If student's level is 3, it means that he replied to the student whose level is 2 and on and on.

I would like to extract a new dataframe that should present a communication between students through the topic PER WEEK. It should contain five columns: "student source", "student destination", "week", "total topics" and "reply count".

I should get something like:

    st_source st_dest  week  total_topics  reply_count
        a        b       1        1             1
        a        b       2        2             1
        a        c       2        1             0
        b        a       1        1             0
        b        a       2        2             0
        b        c       2        1             0
        c        a       2        1             0
        c        b       2        1             1

Student Destination is a student that each student shared the topic with.

Total topics is a number of shared topics with other students. I found it using the following code:

idx_cols = ['topic', 'week']
std_cols = ['student_x', 'student_y']
d1 = df.merge(df, on=idx_cols)
d2 = d1.loc[d1.student_x != d1.student_y, idx_cols + std_cols]

d2.loc[:, std_cols] = np.sort(d2.loc[:, std_cols])

d3 = d2.drop_duplicates().groupby(
    std_cols + ['week']).size().reset_index(name='count')
d3.columns = ['st_source', 'st_dest', 'week', 'total_topics']

I have difficulties to find the last column "reply count".

Reply count is a number of times in which Student Destination "directly" replied to Student Source. If a topic is started by student A (by sending a message at level 1), B replied to A (sending a message at level 2), so B directly replied to A. Consider "direct" a reply from B to A if and only if B replied al level k to a message of A at level k-1 in the same topic. Only students' replies from level 2 to level1.

Does anyone have some suggestions?

Please let me know if I should explain it better.

Thank you!

Trolldejo · Accepted Answer

Full answer, tested, sorry for the previous version, there were many typos....

import pandas as pd
from itertools import permutations

dataframe = {"topic": [1,1,1,2,2,2,2,3,3,3,3,3],
             "student": ["a","b","a","a","b","a","b","c","b","c","a","b"],
             "level": [1,2,3,1,2,3,4,1,2,3,4,5],
             "week": [1,1,1,2,2,2,2,2,2,2,2,2]
             }
dataframe =  pd.DataFrame.from_dict(dataframe)
dataframe = dataframe.reindex_axis(("topic", "student", "level", "week",), axis = 1)


results = {}  # the dictionary where results is going to be stored
source = False  # a simple boolean to make sure message 2 follows message 1
prev_topic = dataframe.get_value(0,'topic')  # boolean to detect topic change
topic_users = set()  # set containing the curent users of the topic
prev_week = None  # variable to check if week is constant in topic.

# print(dataframe)
for row in dataframe.get_values():  # iterate over the dataframe
    # print(prev_topic)

    if prev_topic == row[0]:  # if we are on the same topic
        # print("same_topic")
        # print(row)
        if row[2] == 1:  # if it is an initial message
            # print("first message")
            source = row[1]  # we store users as source
            topic_users.add(source)  # add the user to the topic's set of users
            week = row[3]  # we store the week

        elif row[2] == 2 and source:  # if this is a second message
            # print("scd")
            destination = row[1]  # store user as destination
            topic_users.add(destination)  # add the user to the topic's set of users
            if week != row[3]:  # if the week differs, we print a message
                print("ERROR: Topic " + str(row[0]) + " extends on several weeks")
                # break  # uncomment the line to exit the for loop if error is met

            key = "-".join((source, destination, str(week)))  # construct a key based on source/destination/week
            if key not in results:  # if the key is new to dictionary
                results[key] = [0, 0]  # create the new entry as a list containing topic_counts, reply_counts

            results[key][1] += 1  # add a counter to the reply_counts
            source = False  # reset destination

        else:
            # print("trololo")
            topic_users.add(row[1])  # add the user to the topic's set of users
            if week != row[3]:  # if the week differs, we print a message
                print("ERROR: Topic " + str(row[0]) + " extends on several weeks")
                # break  # uncomment the line to exit the for loop if error is met

            source = False  # reset destination

    else:  # if we enconter a new topic (and not the first one)
        # print('new topic')
        for pair in permutations(topic_users, 2):
            key = "-".join(pair) + "-" + str(week)  # construct a key based on source/destination/week
            if key not in results:   # if the key is new to dictionary
                results[key] = [1, 0]  # create the new entry as a list containing topic_counts, reply_counts
            else:  # otherwise
                results[key][0] += 1  # add a counter to the topic_counts

        topic_users = set()
        if row[2] == 1:  # if it is an initial message
            # print("first message")
            source = row[1]  # we store users as source
            topic_users.add(source)  # add the user to the topic's set of users
            week = row[3]  # we store the week

    prev_topic = row[0]

# redo the topic count feeding for the last topic (for wich we didn't detect a change of topic)
if len(topic_users) > 0:
    for pair in permutations(topic_users, 2):
        key = "-".join(pair) + "-" + str(week)  # construct a key based on source/destination/week
        if key not in results:   # if the key is new to dictionary
            results[key] = [1, 0]  # create the new entry as a list containing topic_counts, reply_counts
        else:  # otherwise
            results[key][0] += 1  # add a counter to the topic_counts

dico = {'source': [], 'destination': [], 'week': [], 'topic': [], 'reply': []}
for k, v in results.items():
    print(k, v)
    s, d, w = k.split('-')
    dico['source'].append(s)
    dico['destination'].append(d)
    dico['week'].append(w)
    dico['topic'].append(v[0])
    dico['reply'].append(v[1])

df = pd.DataFrame.from_dict(dico)
df = df.reindex_axis(("source", "destination", "week", "topic", "reply"), axis = 1)
print(df)

Python interaction between columns and rows

Answers (2)

Related Questions