How to get median line-by-line?

Question

I am able to use group by to get the overall medians for a document e.g. "print(df.groupby(['Key']).median())". But I want to learn the appropriate way to do it line-by-line and seeing if the aforementioned group has changed. Below is one approach that is very clunky and non-pythonic.

csv:

    A,1
    A,2
    A,3
    A,4
    A,5
    A,6
    A,7
    B,8
    B,9
    B,10
    B,11
    B,12
    B,13
    B,14
    B,15
    B,16
    B,17

import pandas as pd
import numpy as np
import statistics
df = pd.read_csv(r"C:\Users\mmcgown\Downloads\PythonMedianTest.csv",names=['Key','Values'])
rows = len(df.iloc[:,0])
i=0
med=[]
while i < rows:
    if i == 0 or df.iloc[(i-1,0)]==df.iloc[(i,0)]:
        med.append(df.iloc[i,1])
        if i==(rows-1):
            print(f"The median of {df.iloc[(i,0)]} is {statistics.median(med)}")
    elif df.iloc[(i-1,0)]!=df.iloc[(i,0)]:
        print(f"The median of {df.iloc[(i-1,0)]} is {statistics.median(med)}")
        med = []
    i += 1

Output:

The median of A is 4
The median of B is 13

I get the same thing as group by, save some rounding error. But I want to do it the most concise, pythonic way, probably using list comprehension.

jottbe · Accepted Answer

A proposal for a more pythonic version could look like this:

med=[]
rows, cols= df.shape
last_group=None
group_field='Key'
med_field='Values'
for i, row in df.iterrows():
    if last_group is None or last_group == row[group_field]:
        med.append(row[med_field])
    else:
        print(f"The median of {last_group} is {statistics.median(med)}")
        med = [row[med_field]]
    last_group= row[group_field]
if med:    
    print(f"The median of {last_group} is {statistics.median(med)}")

I tried to avoid the iloc calls with indexes which are not so easy to read. At first, I didn't get, what you were comparing, to be honest. You also don't need the elif in your case. You can just use else, because your condition is just the negation of a part of the if clause. Then I recognized a difference in the median your version computes and mine computes. If I am not mistaken here, you throw away the verry first value for B, right?

And if you want to get the length of a dataframe, you could use:

rows, cols= df.shape

instead of calling len. I think that is more obvious to the reader of the code, what it does.

How to get median line-by-line?

Answers (1)

Related Questions