Reputation: 427
I have this dataframe called data:
Subjects Professor StudentID
8 Chemistry Jane 999
1 Chemistry Jane 3455
0 Chemistry Joseph 1234
2 History Jane 3455
6 History Smith 323
7 History Smith 999
3 Mathematics Doe 56767
10 Mathematics Einstein 3455
5 Physics Einstein 2834
9 Physics Smith 323
4 Physics Smith 999
I want to run this query "Professors with at least 2 classes with 2 or more of the same students". Desired Output
Smith: Physics, History, 323, 999
I am familiar with SQL and could have done this easily, but I am still beginner in Python. How to achieve this output in Python? Another line of thought is to convert this dataframe into SQL database and have a SQL interface through python to run queries. Is there a way to accomplish that?
Upvotes: 1
Views: 114
Reputation: 862791
Solution with filter
and value_counts
:
df1 = df.groupby('Professor').filter(lambda x: (len(x.Subjects) > 1) &
((x.StudentID.value_counts() > 1).sum() > 1))
print (df1)
Subjects Professor StudentID
6 History Smith 323
7 History Smith 999
9 Physics Smith 323
4 Physics Smith 999
and with duplicated
:
df1 = df.groupby('Professor').filter(lambda x: (len(x.Subjects) > 1) &
(x.StudentID.duplicated().sum() > 1))
print (df1)
Subjects Professor StudentID
6 History Smith 323
7 History Smith 999
9 Physics Smith 323
4 Physics Smith 999
EDIT by comment:
You can return custom output from custom function and then remove NaN
rows by Series.dropna
:
df.StudentID = df.StudentID.astype(str)
def f(x):
if (len(x.Subjects) > 1) & (x.StudentID.duplicated().sum() > 1):
return ', '.join((x.Subjects.unique().tolist() + x.StudentID.unique().tolist()))
df1 = df.groupby('Professor').apply(f).dropna()
df1 = df1.index.to_series() + ': ' + df1
print (df1)
Professor
Smith Smith: History, Physics, 323, 999
dtype: object
Upvotes: 1
Reputation: 294338
students_and_subjects = df.groupby(
['Professor', 'Subjects']
).StudentID.nunique().ge(2) \
.groupby(level='Professor').sum().ge(2)
df[df.Professor.map(students_and_subjects)]
Upvotes: 2