Reputation: 153
This is more of a model design question with python.
I need to parse and extract data from several log files into a pandas DataFrames. From these dataframes I need to create reports (as csv, excel and so on).
One way of design such is to create a file with 2 functions: 1. function to extract data from log file (regex is fine) 2. function of pandas query, something like this:
def get_top1000(group):
return group.sort_index(by='births', ascending=False)[:1000]
grouped = names.groupby(['year', 'sex'])
top1000 = grouped.apply(get_top1000)
Then, my class could get all these queries and produce the reports for this. How this can be implemented with python correctly?
Upvotes: 2
Views: 1593
Reputation: 863166
You need SeriesGroupBy.nlargest
:
df = names.groupby(['year', 'sex'])['births'].nlargest(1000)
Sample:
names = pd.DataFrame({'year':[2000,2000,2000,2000,2000],
'sex':['M','M','F','F','F'],
'births':[7,8,9,1,2]})
print (names)
births sex year
0 7 M 2000
1 8 M 2000
2 9 F 2000
3 1 F 2000
4 2 F 2000
df = names.groupby(['year', 'sex'])['births']
.nlargest(1)
.reset_index(level=2, drop=True)
.reset_index()
print (df)
year sex births
0 2000 F 9
1 2000 M 8
If in your data there are other columns, first set_index
with these columns:
names = pd.DataFrame({'year':[2000,2000,2000,2000,2000],
'sex':['M','M','F','F','F'],
'births':[7,8,9,1,2],
'val':[3,2,4,5,6]})
print (names)
births sex val year
0 7 M 3 2000
1 8 M 2 2000
2 9 F 4 2000
3 1 F 5 2000
4 2 F 6 2000
df = names.set_index('val') \
.groupby(['year', 'sex'])['births'] \
.nlargest(1) \
.reset_index()
print (df)
year sex val births
0 2000 F 4 9
1 2000 M 2 8
Upvotes: 2