Reputation: 21
I've got a csv with 10+ columns and it's grouped based on an index number. For example,
index othercolumn othercolumn2 sample hits othercolumn3
1 cccc bbbb dog 4 aaaa
1 cccc bbbb cat 1 aaaa
1 cccc bbbb cat 2 aaaa
2 cccc bbbb rat 1 aaaa
2 cccc bbbb dog 1 aaaa
3 cccc bbbb bird 1 aaaa
3 cccc bbbb rat 42 aaaa
3 cccc bbbb cat 3 aaaa
Is it possible to find the maximum number of hits for each "group" (by index)? I'm not too sure what to do in a situation where there is no highest hit, like sample 2, but that's not too important for now. For example, the desired output would be something like,
For index 1, the highest hits are 4 for sample dog.
For index 2, the highest hits are 1 for sample rat.
For index 3, the highest hits are 42 for sample rat.
So far, I've used defaultdict to create a dictionary of lists for each group, or index. But I can't seem to get the highest number of hits and cleanly print that out. So far, this is what I have.
from collections import defaultdict
import csv
groups = defaultdict(list)
with open('data.csv') as inputfile:
reader = csv.reader(inputfile)
next(reader, None) # skip the header row
for row in reader:
groups[row[1]].append([row[17], row[18]]) #row 1 is index, row 17 is my sample column, 18 is the hits column
print(groups)
Would appreciate any help!!
Upvotes: 1
Views: 295
Reputation: 101
You can use Pandas 'group_by' method along with the max function to calculate the required output.
Hope the following code will help you.
import pandas as pd
import numpy as np
data = pd.read_csv('data.csv')
c =np.array(data.groupby("index").max())
for i in range(len(c)):
print('For index '+str(i+1)+', the highest hits are '+str(c[i][3])+' for sample '+c[i][2]+'.')
output:
For index 1, the highest hits are 4 for sample dog.
For index 2, the highest hits are 1 for sample rat.
For index 3, the highest hits are 42 for sample rat.
Upvotes: 1
Reputation: 751
Hope it helps you, not that beautiful bot works!
import pandas as pd
# assign data of lists.
data = {'idx': ['1', '1', '1', '2',"2","2","2","3","3","3","8","8"],'sample': ['dog', 'cat', 'dog', 'cat',"cat","dog","fish","dog","fish","ostrich","dog","cat"], 'hits': [1, 2, 3, 1,1,2,2,2,3,42,0,55]}
# Create DataFrame.
df = pd.DataFrame(data)
df["key"] = df["idx"].astype(str)+df["hits"].astype(str)
# Print the output.
df.head(10)
idx sample hits key
0 1 dog 1 11
1 1 cat 2 12
2 1 dog 3 13
3 2 cat 1 21
4 2 cat 1 21
5 2 dog 2 22
6 2 fish 2 22
7 3 dog 2 32
8 3 fish 3 33
9 3 ostrich 42 342
max_idx = df.groupby(["idx"]).max("hits")
max_idx = pd.DataFrame(max_idx)
max_idx
hits
idx
1 3
2 2
3 42
8 55
max_idx.reset_index(level=0, inplace=True)
max_idx["key"] = max_idx["idx"].astype(str)+max_idx["hits"].astype(str)
df_max = df.loc[(df["hits"].isin(max_idx["hits"])) & (df["idx"].isin(max_idx["idx"]))& (df["key"].isin(max_idx["key"]))]
df_max
idx sample hits key
2 1 dog 3 13
5 2 dog 2 22
6 2 fish 2 22
9 3 ostrich 42 342
11 8 cat 55 855
for i, j, k in zip(df_max["idx"],df_max["hits"],df_max["sample"]):
print("For index ", i," the highest hits are ", j," for sample", k,"")
For index 1 the highest hits are 3 for sample dog
For index 2 the highest hits are 2 for sample dog
For index 2 the highest hits are 2 for sample fish
For index 3 the highest hits are 42 for sample ostrich
For index 8 the highest hits are 55 for sample cat
Upvotes: 0