user3241376
user3241376

Reputation: 417

Group rows in a CSV by blocks of 25

I have a csv file with 2 columns, representing a distribution of items per year which looks like this:

A       B

1900    10
1901    2
1903    5
1908    8
1910    25
1925    3
1926    4
1928    1
1950    10

etc, about 15000 lines.

When making a distribution diagram based on this data, it's too many points on an axe, not very pretty. I want to group rows by blocks of 25 years, so that at the end I would have less points at the axe. So, for example, from 1900 till 1925 I would have a sum of produced items, 1 row in A column and 1 row in B column:

1925  53
1950  15

So far I only figured how to convert the data in csv file to int:

o=open('/dates_dist.csv', 'rU')
mydata = csv.reader(o)


def int_wrapper(mydata):
    for v in reader:
        yield map(int, v)

reader = int_wrapper(mydata) 

Can't find how to do it further...

Upvotes: 1

Views: 214

Answers (3)

Mike
Mike

Reputation: 7203

You could create a dummy column and group by it after doing some integer division:

df['temp'] = df['A'] // 25
>>> df
      A   B  temp
0  1900  10    76
1  1901   2    76
2  1903   5    76
3  1908   8    76
4  1910  25    76
5  1925   3    77
6  1926   4    77
7  1928   1    77
8  1950  10    78

>>> df.groupby('temp').sum()
         A   B
temp          
76    9522  50
77    5779   8
78    1950  10

My numbers are slightly different from yours since I am technically grouping from 1900-1924, 1925-1949, and 1950-1974, but the idea is the same.

Upvotes: 0

kundan
kundan

Reputation: 1288

Here is my approach . Its definitely not the most engaging python code, but could be a way to achieve the desired output.

if __name__ == '__main__':   

    o=open('dates_dist.csv', 'rU')
    lines = o.read().split("\n") # Create a list having each line of the file

    out_dict = {}
    curr_date  = 0;
    curr_count = 0
    chunk_sz   =  25; #years
    if len(lines) > 0:
        line_split = lines[0].split(",")
        start_year = int(line_split[0])
        curr_count = 0

        # Iterate over each line of the file
        for line in lines:
            # Split at comma to get the year and the count. 
            # line_split[0] will be the year and line_split[1] will be the count.
            line_split = line.split(",")
            curr_year = int(line_split[0])
            time_delta = curr_year-start_year

            if time_delta<chunk_sz or time_delta == chunk_sz:
                curr_count = curr_count + int(line_split[1])
            else:
                out_dict[start_year+chunk_sz] = curr_count
                start_year = start_year+chunk_sz
                curr_count = int(line_split[1])

            #print curr_year , curr_count    

        out_dict[start_year+chunk_sz] = curr_count
    print out_dict        

Upvotes: 0

unutbu
unutbu

Reputation: 879461

You could use itertools.groupby:

import itertools as IT
import csv

def int_wrapper(mydata):
    for v in mydata:
        yield map(int, v)


with open('data', 'rU') as o:
    mydata = csv.reader(o)
    header = next(mydata)
    reader = int_wrapper(mydata)
    for key, group in IT.groupby(reader, lambda row: (row[0]-1)//25+1):
        year = key*25
        total = sum(row[1] for row in group)
        print(year, total)

yields

(1900, 10)
(1925, 43)
(1950, 15)

Note that 1900 to 1925 (inclusive) spans 26 years, not 25. So if you want to group 25 years, given the way you are reporting the totals, you probably want the half-open interval (1900, 1925].


The expression row[0]//25 takes the year and integer divides by 25. This number will be the same for all numbers in the range [1900, 1925). To make the range half-open on the left, subtract and add 1: (row[0]-1)//25+1.

Upvotes: 3

Related Questions