Reputation: 381
I've a dataframe having 3 columns - A, B and projects. A and B contains the IDs given to a each student in a class and projects is #of projects they did together. The data is at the level of column A and column B
A | B | projects
S2 S3 5
S2 S4 2
S5 S2 1
S5 S4 1
Note, a student can appear in any of the columns in input dataset (S2 is in both cola and colB) Now, I need to find number of projects done by a student with all other students. The dataframe should look like this
id_ | StudentB | projects
S2 S3 5
S2 S5 1
S2 S4 2
S3 S2 5
S4 S2 2
S4 S5 1
S5 S4 1
Now, if I filter column id_ for a particular studentID, all related IDs should be listed in column StudentB
My solution ('all_student_id' is a distinct list of all IDs possible)-
final_df = pd.DataFrame(columns = ['id_', 'studentB','projects'])
for id_ in all_student_id:
data_ = data[(data['A']== id_) | (data['B']== id_)]
a = data_[['A','projects']].rename(columns= {'A':'studentB'})
b = data_[['B','projects']].rename(columns= {'B':'studentB'})
a_b_concat = pd.concat([a,b], axis=0)
formatted = a_b_concat[a_b_concat['studentB']!=id_]
temp = pd.DataFrame({'id_':[id_]*formatted.shape[0]
})
temp = pd.concat([temp, formatted.reset_index(drop = True)], axis = 1)
final_df= pd.concat([final_df, temp])
While, I am able to do so using a for loop for all the distinct student ID, I believe that, it is not a scalable solution since the input dataset can be huge (30 million rows).
Any help in optimizing this solution will be appreciated
Upvotes: 1
Views: 66
Reputation: 51165
The other answers are recommending groupby
, but from your desired output I do not agree that is what you are looking for. It seems that you simply want the relationships from B->A
to also be included as relationships from A->B
. This is a trivial operation, which can be accomplished by stacking a reversed version of columns A
and B
a = df.values
b = a[:, [1,0,2]].copy()
d = pd.DataFrame(np.vstack((a, b)), columns=['id_', 'StudentB', 'projects'])
id_ StudentB projects
0 S2 S3 5
1 S2 S4 2
2 S5 S2 1
3 S5 S4 1
4 S3 S2 5
5 S4 S2 2
6 S2 S5 1
7 S4 S5 1
Now you can lookup any student using only the id_
column, although I would recommend a pivot
here for a better data structure:
lookp = d.pivot('id_', 'StudentB', 'projects')
StudentB S2 S3 S4 S5
id_
S2 NaN 5 2 1
S3 5 NaN NaN NaN
S4 2 NaN NaN 1
S5 1 NaN 1 NaN
This gives you an easy way to lookup student relationships, with a NaN
representing that two students did not work together on any projects.
>>> lookp.loc['S2', 'S3']
5
>>> lookp.loc['S3', 'S5']
nan
Upvotes: 0
Reputation: 2510
Will this work?
So let me know if this sample works the way you want:
m=pd.DataFrame({'A':("S2","S2","S5","S5",'S2'),'B':("S3","S4","S2","S4",'S5'), 'projects':(5,2,1,1,6)})
This would be somewhat like:
A B projects
0 S2 S3 5
1 S2 S4 2
2 S5 S2 1
3 S5 S4 1
4 S2 S5 6
Now what I assumed is, that you want, if ever let's say S2 and S5 came together, either in column 1 or column 2, they should counted as same as vice-versa. Taking this liberty, I performed some analysis and got this:
f=np.sort(m.iloc[:,0:2].values)
pd.concat((pd.DataFrame(f),m[['projects']]),axis=1).groupby([0,1])['projects'].sum()
Output I got was:
0 1
S2 S3 5
S4 2
S5 7
S4 S5 1
It's renaming column as 0 and 1, which can be changed by set_axis. The main point is, is this how you want? That S2 and S5, no matter order, gave their sum as output?
Upvotes: 0
Reputation: 21709
You can do:
# sort the students - create a new data frame
df1 = df[['A','B']]
df1.values.sort(axis=1)
df1['projects'] = df['projects']
# now groupb
df1.groupby(['A','B'])['projects'].sum().reset_index()
A B projects
0 S2 S3 5
1 S2 S4 2
2 S2 S5 1
3 S4 S5 1
Upvotes: 1