Sheron
Sheron

Reputation: 615

Python interaction between columns and rows

I have the following dataframe:

      topic  student level week
        1      a       1     1
        1      b       2     1
        1      a       3     1
        2      a       1     2
        2      b       2     2
        2      a       3     2
        2      b       4     2
        3      c       1     2
        3      b       2     2
        3      c       3     2
        3      a       4     2
        3      b       5     2

It contains a column level that specifies who started the topic and who replied to it. If student's level is 1, it means that he asked the question. If student's level is 2, it means that he replied to the student who asked the question. If student's level is 3, it means that he replied to the student whose level is 2 and on and on.

I would like to extract a new dataframe that should present a communication between students through the topic PER WEEK. It should contain five columns: "student source", "student destination", "week", "total topics" and "reply count".

I should get something like:

    st_source st_dest  week  total_topics  reply_count
        a        b       1        1             1
        a        b       2        2             1
        a        c       2        1             0
        b        a       1        1             0
        b        a       2        2             0
        b        c       2        1             0
        c        a       2        1             0
        c        b       2        1             1

Student Destination is a student that each student shared the topic with.

Total topics is a number of shared topics with other students. I found it using the following code:

idx_cols = ['topic', 'week']
std_cols = ['student_x', 'student_y']
d1 = df.merge(df, on=idx_cols)
d2 = d1.loc[d1.student_x != d1.student_y, idx_cols + std_cols]

d2.loc[:, std_cols] = np.sort(d2.loc[:, std_cols])

d3 = d2.drop_duplicates().groupby(
    std_cols + ['week']).size().reset_index(name='count')
d3.columns = ['st_source', 'st_dest', 'week', 'total_topics']

I have difficulties to find the last column "reply count".

Reply count is a number of times in which Student Destination "directly" replied to Student Source. If a topic is started by student A (by sending a message at level 1), B replied to A (sending a message at level 2), so B directly replied to A. Consider "direct" a reply from B to A if and only if B replied al level k to a message of A at level k-1 in the same topic. Only students' replies from level 2 to level1.

Does anyone have some suggestions?

Please let me know if I should explain it better.

Thank you!

Upvotes: 0

Views: 378

Answers (2)

Trolldejo
Trolldejo

Reputation: 466

Full answer, tested, sorry for the previous version, there were many typos....

import pandas as pd
from itertools import permutations

dataframe = {"topic": [1,1,1,2,2,2,2,3,3,3,3,3],
             "student": ["a","b","a","a","b","a","b","c","b","c","a","b"],
             "level": [1,2,3,1,2,3,4,1,2,3,4,5],
             "week": [1,1,1,2,2,2,2,2,2,2,2,2]
             }
dataframe =  pd.DataFrame.from_dict(dataframe)
dataframe = dataframe.reindex_axis(("topic", "student", "level", "week",), axis = 1)


results = {}  # the dictionary where results is going to be stored
source = False  # a simple boolean to make sure message 2 follows message 1
prev_topic = dataframe.get_value(0,'topic')  # boolean to detect topic change
topic_users = set()  # set containing the curent users of the topic
prev_week = None  # variable to check if week is constant in topic.

# print(dataframe)
for row in dataframe.get_values():  # iterate over the dataframe
    # print(prev_topic)

    if prev_topic == row[0]:  # if we are on the same topic
        # print("same_topic")
        # print(row)
        if row[2] == 1:  # if it is an initial message
            # print("first message")
            source = row[1]  # we store users as source
            topic_users.add(source)  # add the user to the topic's set of users
            week = row[3]  # we store the week

        elif row[2] == 2 and source:  # if this is a second message
            # print("scd")
            destination = row[1]  # store user as destination
            topic_users.add(destination)  # add the user to the topic's set of users
            if week != row[3]:  # if the week differs, we print a message
                print("ERROR: Topic " + str(row[0]) + " extends on several weeks")
                # break  # uncomment the line to exit the for loop if error is met

            key = "-".join((source, destination, str(week)))  # construct a key based on source/destination/week
            if key not in results:  # if the key is new to dictionary
                results[key] = [0, 0]  # create the new entry as a list containing topic_counts, reply_counts

            results[key][1] += 1  # add a counter to the reply_counts
            source = False  # reset destination

        else:
            # print("trololo")
            topic_users.add(row[1])  # add the user to the topic's set of users
            if week != row[3]:  # if the week differs, we print a message
                print("ERROR: Topic " + str(row[0]) + " extends on several weeks")
                # break  # uncomment the line to exit the for loop if error is met

            source = False  # reset destination

    else:  # if we enconter a new topic (and not the first one)
        # print('new topic')
        for pair in permutations(topic_users, 2):
            key = "-".join(pair) + "-" + str(week)  # construct a key based on source/destination/week
            if key not in results:   # if the key is new to dictionary
                results[key] = [1, 0]  # create the new entry as a list containing topic_counts, reply_counts
            else:  # otherwise
                results[key][0] += 1  # add a counter to the topic_counts

        topic_users = set()
        if row[2] == 1:  # if it is an initial message
            # print("first message")
            source = row[1]  # we store users as source
            topic_users.add(source)  # add the user to the topic's set of users
            week = row[3]  # we store the week

    prev_topic = row[0]

# redo the topic count feeding for the last topic (for wich we didn't detect a change of topic)
if len(topic_users) > 0:
    for pair in permutations(topic_users, 2):
        key = "-".join(pair) + "-" + str(week)  # construct a key based on source/destination/week
        if key not in results:   # if the key is new to dictionary
            results[key] = [1, 0]  # create the new entry as a list containing topic_counts, reply_counts
        else:  # otherwise
            results[key][0] += 1  # add a counter to the topic_counts

dico = {'source': [], 'destination': [], 'week': [], 'topic': [], 'reply': []}
for k, v in results.items():
    print(k, v)
    s, d, w = k.split('-')
    dico['source'].append(s)
    dico['destination'].append(d)
    dico['week'].append(w)
    dico['topic'].append(v[0])
    dico['reply'].append(v[1])

df = pd.DataFrame.from_dict(dico)
df = df.reindex_axis(("source", "destination", "week", "topic", "reply"), axis = 1)
print(df)

Upvotes: 1

Trolldejo
Trolldejo

Reputation: 466

My suggestion:

I would use a dictionary containing 'source-destination-week' as keys and (total_topics, reply_counts) as values.

Loop over the first dataframe, for each question, store who posted 1st message as the destination, store who posted 2nd message as source, store the week as week, add a counter in dictionary at key 'source-destination-week'. I note that you no longer require to display students pairs with no interaction, thus I removed it. eg:

from itertools import permutations

results = {}  # the dictionary where results is going to be stored
source = False  # a simple boolean to make sure message 2 follows message 1
prev_topic = None  # boolean to detect topic change
topic_users = set()  # set containing the curent users of the topic
prev_week = None  # variable to check if week is constant in topic.

for row in dataframe:  # iterate over the dataframe

    if prev_topic = row[0]:  # if we are on the same topic

        if row[2] == 1:  # if it is an initial message
            source = row[1]  # we store users as source
            topic_users.add(source)  # add the user to the topic's set of users
            week = row[3]  # we store the week

        elif row[2] == 2 and source:  # if this is a second message
            destination = row[1]  # store user as destination
            topic_users.add(destination)  # add the user to the topic's set of users
            if week != row[3]:  # if the week differs, we print a message
                print "ERROR: Topic " + str(row[0]) + " extends on several weeks"
                # break  # uncomment the line to exit the for loop if error is met

            key = "-".join((source, destination, week))  # construct a key based on source/destination/week
            if key not in results:  # if the key is new to dictionary
                results[key] = [0, 0]  # create the new entry as a list containing topic_counts, reply_counts

            results[key][1] += 1  # add a counter to the reply_counts
            source = False  # reset destination

        else:
            topic_user.add(row[1])  # add the user to the topic's set of users
            if week != row[3]:  # if the week differs, we print a message
                print "ERROR: Topic " + str(row[0]) + " extends on several weeks"
                # break  # uncomment the line to exit the for loop if error is met

            source = False  # reset destination

    elif prev_topic != None:  # if we enconter a new topic (and not the first one)
        for pair in permutations(topic_users, 2):
            key = "-".join(pair) + "-" + week  # construct a key based on source/destination/week
            if key not in results:   # if the key is new to dictionary
                results[key] = [1, 0]  # create the new entry as a list containing topic_counts, reply_counts
            else:  # otherwise
                results[key][0] += 1  # add a counter to the topic_counts

        topic_users = set()

    prev_topic = row[0]

# redo the topic count feeding for the last topic (for wich we didn't detect a change of topic)
if len(topic_users) > 0: 
    for pair in permutations(topic_users, 2):
        key = "-".join(pair) + "-" + week  # construct a key based on source/destination/week
        if key not in results:   # if the key is new to dictionary
            results[key] = [1, 0]  # create the new entry as a list containing topic_counts, reply_counts
        else:  # otherwise
            results[key][0] += 1  # add a counter to the topic_counts

then you can convert your dictionary back into a dataframe. eg:

dico = {'b-a': [0,1], 'b-c' : [1,1], 'a-b': [2,1]}
df = pd.DataFrame.from_dict(dico, orient='index')
df.rename(index="str", columns={0:'topic', 1:'reply'})

I hope I didn't make any typo in the code, couldn't test it yet... At your disposal for any question :)

Upvotes: 1

Related Questions