r squared
r squared

Reputation: 21

Trying to find the maximum value of EACH group in my dictionary in Python

I've got a csv with 10+ columns and it's grouped based on an index number. For example,

index othercolumn othercolumn2 sample hits othercolumn3
1     cccc        bbbb         dog    4    aaaa   
1     cccc        bbbb         cat    1    aaaa   
1     cccc        bbbb         cat    2    aaaa   

2     cccc        bbbb         rat    1    aaaa   
2     cccc        bbbb         dog    1    aaaa   

3     cccc        bbbb         bird   1    aaaa   
3     cccc        bbbb         rat    42   aaaa   
3     cccc        bbbb         cat    3    aaaa  

Is it possible to find the maximum number of hits for each "group" (by index)? I'm not too sure what to do in a situation where there is no highest hit, like sample 2, but that's not too important for now. For example, the desired output would be something like,

For index 1, the highest hits are 4 for sample dog.
For index 2, the highest hits are 1 for sample rat.       
For index 3, the highest hits are 42 for sample rat.

So far, I've used defaultdict to create a dictionary of lists for each group, or index. But I can't seem to get the highest number of hits and cleanly print that out. So far, this is what I have.

from collections import defaultdict
import csv

groups = defaultdict(list)

with open('data.csv') as inputfile:
    reader = csv.reader(inputfile)
    next(reader, None)  # skip the header row

    for row in reader:
        groups[row[1]].append([row[17], row[18]]) #row 1 is index, row 17 is my sample column, 18 is the hits column
        
print(groups)

Would appreciate any help!!

Upvotes: 1

Views: 295

Answers (2)

codezero
codezero

Reputation: 101

You can use Pandas 'group_by' method along with the max function to calculate the required output.

Hope the following code will help you.

import pandas as pd
import numpy as np
data = pd.read_csv('data.csv')
c =np.array(data.groupby("index").max())
for i in range(len(c)):
    print('For index '+str(i+1)+', the highest hits are '+str(c[i][3])+' for sample '+c[i][2]+'.')

output:

For index 1, the highest hits are 4 for sample dog.
For index 2, the highest hits are 1 for sample rat.
For index 3, the highest hits are 42 for sample rat.

Upvotes: 1

Gustavo Zantut
Gustavo Zantut

Reputation: 751

Hope it helps you, not that beautiful bot works!

import pandas as pd
# assign data of lists.
data = {'idx': ['1', '1', '1', '2',"2","2","2","3","3","3","8","8"],'sample': ['dog', 'cat', 'dog', 'cat',"cat","dog","fish","dog","fish","ostrich","dog","cat"], 'hits': [1, 2, 3, 1,1,2,2,2,3,42,0,55]}
# Create DataFrame.
df = pd.DataFrame(data)
df["key"] = df["idx"].astype(str)+df["hits"].astype(str)
# Print the output.
df.head(10)

idx     sample  hits    key
0   1   dog     1   11
1   1   cat     2   12
2   1   dog     3   13
3   2   cat     1   21
4   2   cat     1   21
5   2   dog     2   22
6   2   fish    2   22
7   3   dog     2   32
8   3   fish    3   33
9   3   ostrich     42  342


max_idx = df.groupby(["idx"]).max("hits")
max_idx = pd.DataFrame(max_idx)
max_idx
    hits
idx     
1   3
2   2
3   42
8   55

max_idx.reset_index(level=0, inplace=True)
max_idx["key"] = max_idx["idx"].astype(str)+max_idx["hits"].astype(str)
df_max = df.loc[(df["hits"].isin(max_idx["hits"])) & (df["idx"].isin(max_idx["idx"]))& (df["key"].isin(max_idx["key"]))]
df_max
    idx     sample  hits    key
2   1   dog     3   13
5   2   dog     2   22
6   2   fish    2   22
9   3   ostrich     42  342
11  8   cat     55  855

for i, j, k in zip(df_max["idx"],df_max["hits"],df_max["sample"]):
    print("For index ", i," the highest hits are ", j," for sample", k,"")
For index  1  the highest hits are  3  for sample dog 
For index  2  the highest hits are  2  for sample dog 
For index  2  the highest hits are  2  for sample fish 
For index  3  the highest hits are  42  for sample ostrich 
For index  8  the highest hits are  55  for sample cat 

Upvotes: 0

Related Questions