user3788040
user3788040

Reputation: 381

Effectively filtering in a pandas Dataframe

I've a dataframe having 3 columns - A, B and projects. A and B contains the IDs given to a each student in a class and projects is #of projects they did together. The data is at the level of column A and column B

A |  B | projects
S2   S3    5
S2   S4    2
S5   S2    1
S5   S4    1

Note, a student can appear in any of the columns in input dataset (S2 is in both cola and colB) Now, I need to find number of projects done by a student with all other students. The dataframe should look like this

id_ | StudentB | projects
S2     S3          5
S2     S5          1
S2     S4          2
S3     S2          5
S4     S2          2
S4     S5          1
S5     S4          1

Now, if I filter column id_ for a particular studentID, all related IDs should be listed in column StudentB

My solution ('all_student_id' is a distinct list of all IDs possible)-

final_df = pd.DataFrame(columns = ['id_', 'studentB','projects'])
for id_ in all_student_id:
    data_ = data[(data['A']== id_) | (data['B']== id_)] 

    a = data_[['A','projects']].rename(columns= {'A':'studentB'})
    b = data_[['B','projects']].rename(columns= {'B':'studentB'})

    a_b_concat = pd.concat([a,b], axis=0)
    formatted = a_b_concat[a_b_concat['studentB']!=id_]

    temp = pd.DataFrame({'id_':[id_]*formatted.shape[0]
                        })
    temp = pd.concat([temp, formatted.reset_index(drop = True)], axis = 1)

    final_df= pd.concat([final_df, temp])

While, I am able to do so using a for loop for all the distinct student ID, I believe that, it is not a scalable solution since the input dataset can be huge (30 million rows).

Any help in optimizing this solution will be appreciated

Upvotes: 1

Views: 66

Answers (3)

user3483203
user3483203

Reputation: 51165

The other answers are recommending groupby, but from your desired output I do not agree that is what you are looking for. It seems that you simply want the relationships from B->A to also be included as relationships from A->B. This is a trivial operation, which can be accomplished by stacking a reversed version of columns A and B


a = df.values
b = a[:, [1,0,2]].copy()

d = pd.DataFrame(np.vstack((a, b)), columns=['id_', 'StudentB', 'projects'])

  id_ StudentB projects
0  S2       S3        5
1  S2       S4        2
2  S5       S2        1
3  S5       S4        1
4  S3       S2        5
5  S4       S2        2
6  S2       S5        1
7  S4       S5        1

Now you can lookup any student using only the id_ column, although I would recommend a pivot here for a better data structure:

lookp = d.pivot('id_', 'StudentB', 'projects')

StudentB   S2   S3   S4   S5
id_
S2        NaN    5    2    1
S3          5  NaN  NaN  NaN
S4          2  NaN  NaN    1
S5          1  NaN    1  NaN

This gives you an easy way to lookup student relationships, with a NaN representing that two students did not work together on any projects.

>>> lookp.loc['S2', 'S3']
5
>>> lookp.loc['S3', 'S5']
nan

Upvotes: 0

Amit Amola
Amit Amola

Reputation: 2510

Will this work?

So let me know if this sample works the way you want:

m=pd.DataFrame({'A':("S2","S2","S5","S5",'S2'),'B':("S3","S4","S2","S4",'S5'), 'projects':(5,2,1,1,6)})

This would be somewhat like:

    A   B  projects
0  S2  S3         5
1  S2  S4         2
2  S5  S2         1
3  S5  S4         1
4  S2  S5         6

Now what I assumed is, that you want, if ever let's say S2 and S5 came together, either in column 1 or column 2, they should counted as same as vice-versa. Taking this liberty, I performed some analysis and got this:

f=np.sort(m.iloc[:,0:2].values)
pd.concat((pd.DataFrame(f),m[['projects']]),axis=1).groupby([0,1])['projects'].sum()

Output I got was:

0   1 
S2  S3    5
    S4    2
    S5    7
S4  S5    1

It's renaming column as 0 and 1, which can be changed by set_axis. The main point is, is this how you want? That S2 and S5, no matter order, gave their sum as output?

Upvotes: 0

YOLO
YOLO

Reputation: 21709

You can do:

# sort the students - create a new data frame
df1 = df[['A','B']]
df1.values.sort(axis=1)
df1['projects'] = df['projects']

# now groupb
df1.groupby(['A','B'])['projects'].sum().reset_index()

    A   B   projects
0   S2  S3  5
1   S2  S4  2
2   S2  S5  1
3   S4  S5  1

Upvotes: 1

Related Questions