Nagri
Nagri

Reputation: 3136

query a csv using pandas

I have a csv file like this:

year    month   Company A   Company B   Company C
1990     Jan    10  15  20
1990    Feb 11  14  21
1990    mar 13  8   23
1990    april   12  22  19
1990    may 15  12  18
1990    june    18  13  13
1990    june    12  14  15
1990    july    12  14  16
1991     Jan    11  16  13
1991    Feb 14  17  11
1991    mar 23  13  12
1991    april   23  21  10
1991    may 22  22  9
1991    june    24  20  32
1991    june    12  14  15
1991    july    21  14  16
1992     Jan    10  13  26
1992    Feb 9   11  19
1992    mar 23  12  18
1992    april   12  10  21
1992    may 17  9   10
1992    june    15  42  9
1992    june    16  9   26
1992    july    15  26  19
1993     Jan    18  19  20
1993    Feb 19  18  21
1993    mar 20  21  23
1993    april   21  10  19
1993    may 13  9   14
1993    june    14  23  23
1993    june    15  21  23
1993    july    16  10  22

I want to find out for each company the month and year where they had the highest number of sale for ex: for company A in year 1990 they had highest sale of 18. I want to do this using pandas. but to understand how to proceed with this. pointers needed please.

ps: here is what I have done till now.

import pandas as pd
df = pd.read_csv('SAMPLE.csv')
num_of_rows = len(df.index)
years_list = []
months_list = []
company_list = df.columns[2:]
for each in df.columns[2:]:
    each = []

for i in range(0,num_of_rows):
    years_list.append(df[df.columns[0]][i])
    months_list.append(df[df.columns[1]][i])
years_list = list(set(years_list))
months_list = list(set(months_list))
for each in years_list:
    for c in company_list:
        print df[(df.year == each)][c].max()

I am getting the biggest number for a year for a company but how to get the month and year also I dont know.

Upvotes: 0

Views: 4474

Answers (1)

EdChum
EdChum

Reputation: 394041

Use a combination of idxmax() and loc to filter the dataframe:

In [36]:

import pandas as pd
import io
temp = """year month Company_A Company_B Company_C
1990 Jan 10 15 20
1990 Feb 11 14 21
1990 mar 13 8 23
1990 april 12 22 19
1990 may 15 12 18
1990 june 18 13 13
1990 june 12 14 15
1990 july 12 14 16
1991 Jan 11 16 13
1991 Feb 14 17 11
1991 mar 23 13 12
1991 april 23 21 10
1991 may 22 22 9
1991 june 24 20 32
1991 june 12 14 15
1991 july 21 14 16
1992 Jan 10 13 26
1992 Feb 9 11 19
1992 mar 23 12 18
1992 april 12 10 21
1992 may 17 9 10
1992 june 15 42 9
1992 june 16 9 26
1992 july 15 26 19
1993 Jan 18 19 20
1993 Feb 19 18 21
1993 mar 20 21 23
1993 april 21 10 19
1993 may 13 9 14
1993 june 14 23 23
1993 june 15 21 23
1993 july 16 10 22"""
df = pd.read_csv(io.StringIO(temp),sep='\s+')
# the call to .unique() is because the same row for A and C appears twice
df.loc[df[['Company_A', 'Company_B', 'Company_C']].idxmax().unique()]

Out[36]:
    year month  Company_A  Company_B  Company_C
13  1991  june         24         20         32
21  1992  june         15         42          9

Upvotes: 1

Related Questions