Reputation: 615
I have the following dataframe:
topic student level week
1 a 1 1
1 b 2 1
1 a 3 1
2 a 1 2
2 b 2 2
2 a 3 2
2 b 4 2
3 c 1 2
3 b 2 2
3 c 3 2
3 a 4 2
3 b 5 2
It contains a column level that specifies who started the topic and who replied to it. If student's level is 1, it means that he asked the question. If student's level is 2, it means that he replied to the student who asked the question. If student's level is 3, it means that he replied to the student whose level is 2 and on and on.
I would like to extract a new dataframe that should present a communication between students through the topic PER WEEK. It should contain five columns: "student source", "student destination", "week", "total topics" and "reply count".
I should get something like:
st_source st_dest week total_topics reply_count
a b 1 1 1
a b 2 2 1
a c 2 1 0
b a 1 1 0
b a 2 2 0
b c 2 1 0
c a 2 1 0
c b 2 1 1
Student Destination is a student that each student shared the topic with.
Total topics is a number of shared topics with other students. I found it using the following code:
idx_cols = ['topic', 'week']
std_cols = ['student_x', 'student_y']
d1 = df.merge(df, on=idx_cols)
d2 = d1.loc[d1.student_x != d1.student_y, idx_cols + std_cols]
d2.loc[:, std_cols] = np.sort(d2.loc[:, std_cols])
d3 = d2.drop_duplicates().groupby(
std_cols + ['week']).size().reset_index(name='count')
d3.columns = ['st_source', 'st_dest', 'week', 'total_topics']
I have difficulties to find the last column "reply count".
Reply count is a number of times in which Student Destination "directly" replied to Student Source. If a topic is started by student A (by sending a message at level 1), B replied to A (sending a message at level 2), so B directly replied to A. Consider "direct" a reply from B to A if and only if B replied al level k to a message of A at level k-1 in the same topic. Only students' replies from level 2 to level1.
Does anyone have some suggestions?
Please let me know if I should explain it better.
Thank you!
Upvotes: 0
Views: 378
Reputation: 466
Full answer, tested, sorry for the previous version, there were many typos....
import pandas as pd
from itertools import permutations
dataframe = {"topic": [1,1,1,2,2,2,2,3,3,3,3,3],
"student": ["a","b","a","a","b","a","b","c","b","c","a","b"],
"level": [1,2,3,1,2,3,4,1,2,3,4,5],
"week": [1,1,1,2,2,2,2,2,2,2,2,2]
}
dataframe = pd.DataFrame.from_dict(dataframe)
dataframe = dataframe.reindex_axis(("topic", "student", "level", "week",), axis = 1)
results = {} # the dictionary where results is going to be stored
source = False # a simple boolean to make sure message 2 follows message 1
prev_topic = dataframe.get_value(0,'topic') # boolean to detect topic change
topic_users = set() # set containing the curent users of the topic
prev_week = None # variable to check if week is constant in topic.
# print(dataframe)
for row in dataframe.get_values(): # iterate over the dataframe
# print(prev_topic)
if prev_topic == row[0]: # if we are on the same topic
# print("same_topic")
# print(row)
if row[2] == 1: # if it is an initial message
# print("first message")
source = row[1] # we store users as source
topic_users.add(source) # add the user to the topic's set of users
week = row[3] # we store the week
elif row[2] == 2 and source: # if this is a second message
# print("scd")
destination = row[1] # store user as destination
topic_users.add(destination) # add the user to the topic's set of users
if week != row[3]: # if the week differs, we print a message
print("ERROR: Topic " + str(row[0]) + " extends on several weeks")
# break # uncomment the line to exit the for loop if error is met
key = "-".join((source, destination, str(week))) # construct a key based on source/destination/week
if key not in results: # if the key is new to dictionary
results[key] = [0, 0] # create the new entry as a list containing topic_counts, reply_counts
results[key][1] += 1 # add a counter to the reply_counts
source = False # reset destination
else:
# print("trololo")
topic_users.add(row[1]) # add the user to the topic's set of users
if week != row[3]: # if the week differs, we print a message
print("ERROR: Topic " + str(row[0]) + " extends on several weeks")
# break # uncomment the line to exit the for loop if error is met
source = False # reset destination
else: # if we enconter a new topic (and not the first one)
# print('new topic')
for pair in permutations(topic_users, 2):
key = "-".join(pair) + "-" + str(week) # construct a key based on source/destination/week
if key not in results: # if the key is new to dictionary
results[key] = [1, 0] # create the new entry as a list containing topic_counts, reply_counts
else: # otherwise
results[key][0] += 1 # add a counter to the topic_counts
topic_users = set()
if row[2] == 1: # if it is an initial message
# print("first message")
source = row[1] # we store users as source
topic_users.add(source) # add the user to the topic's set of users
week = row[3] # we store the week
prev_topic = row[0]
# redo the topic count feeding for the last topic (for wich we didn't detect a change of topic)
if len(topic_users) > 0:
for pair in permutations(topic_users, 2):
key = "-".join(pair) + "-" + str(week) # construct a key based on source/destination/week
if key not in results: # if the key is new to dictionary
results[key] = [1, 0] # create the new entry as a list containing topic_counts, reply_counts
else: # otherwise
results[key][0] += 1 # add a counter to the topic_counts
dico = {'source': [], 'destination': [], 'week': [], 'topic': [], 'reply': []}
for k, v in results.items():
print(k, v)
s, d, w = k.split('-')
dico['source'].append(s)
dico['destination'].append(d)
dico['week'].append(w)
dico['topic'].append(v[0])
dico['reply'].append(v[1])
df = pd.DataFrame.from_dict(dico)
df = df.reindex_axis(("source", "destination", "week", "topic", "reply"), axis = 1)
print(df)
Upvotes: 1
Reputation: 466
My suggestion:
I would use a dictionary containing 'source-destination-week' as keys and (total_topics, reply_counts) as values.
Loop over the first dataframe, for each question, store who posted 1st message as the destination, store who posted 2nd message as source, store the week as week, add a counter in dictionary at key 'source-destination-week'. I note that you no longer require to display students pairs with no interaction, thus I removed it. eg:
from itertools import permutations
results = {} # the dictionary where results is going to be stored
source = False # a simple boolean to make sure message 2 follows message 1
prev_topic = None # boolean to detect topic change
topic_users = set() # set containing the curent users of the topic
prev_week = None # variable to check if week is constant in topic.
for row in dataframe: # iterate over the dataframe
if prev_topic = row[0]: # if we are on the same topic
if row[2] == 1: # if it is an initial message
source = row[1] # we store users as source
topic_users.add(source) # add the user to the topic's set of users
week = row[3] # we store the week
elif row[2] == 2 and source: # if this is a second message
destination = row[1] # store user as destination
topic_users.add(destination) # add the user to the topic's set of users
if week != row[3]: # if the week differs, we print a message
print "ERROR: Topic " + str(row[0]) + " extends on several weeks"
# break # uncomment the line to exit the for loop if error is met
key = "-".join((source, destination, week)) # construct a key based on source/destination/week
if key not in results: # if the key is new to dictionary
results[key] = [0, 0] # create the new entry as a list containing topic_counts, reply_counts
results[key][1] += 1 # add a counter to the reply_counts
source = False # reset destination
else:
topic_user.add(row[1]) # add the user to the topic's set of users
if week != row[3]: # if the week differs, we print a message
print "ERROR: Topic " + str(row[0]) + " extends on several weeks"
# break # uncomment the line to exit the for loop if error is met
source = False # reset destination
elif prev_topic != None: # if we enconter a new topic (and not the first one)
for pair in permutations(topic_users, 2):
key = "-".join(pair) + "-" + week # construct a key based on source/destination/week
if key not in results: # if the key is new to dictionary
results[key] = [1, 0] # create the new entry as a list containing topic_counts, reply_counts
else: # otherwise
results[key][0] += 1 # add a counter to the topic_counts
topic_users = set()
prev_topic = row[0]
# redo the topic count feeding for the last topic (for wich we didn't detect a change of topic)
if len(topic_users) > 0:
for pair in permutations(topic_users, 2):
key = "-".join(pair) + "-" + week # construct a key based on source/destination/week
if key not in results: # if the key is new to dictionary
results[key] = [1, 0] # create the new entry as a list containing topic_counts, reply_counts
else: # otherwise
results[key][0] += 1 # add a counter to the topic_counts
then you can convert your dictionary back into a dataframe. eg:
dico = {'b-a': [0,1], 'b-c' : [1,1], 'a-b': [2,1]}
df = pd.DataFrame.from_dict(dico, orient='index')
df.rename(index="str", columns={0:'topic', 1:'reply'})
I hope I didn't make any typo in the code, couldn't test it yet... At your disposal for any question :)
Upvotes: 1