Peter S
Peter S

Reputation: 575

Sorting multi column .csv File Python

My Data looks like this:

549  648.077  0.01
552  648.141  0.45
554  647.167  0.1
572  648.141  0.3
530  630.213  0.69
560  670.312  0.70

there are a few thousand lines in the file

the 1strow values range from 0-1100

the 2ndrow values range from 600-700

the 3rdrow values range from 0-1 I need to plot the data and therefore need to sort and modify the data:

I need to split the 3rd row values (normal range 0.0-1.0) into segments 0.0-0.20, 0.21-0.40, 0,41-0,60, 0.61-0.80,0.81-1.00

Next I need to split the segments from the 1strow (normal range (0-1100) into segments like 0-10, 11-20,21-30and so on up to 1100. What I want to do is find all 2ndrow values within a region 0.0-0.20 and 0-10 , 0.0-0.20 and 11-20,0.0-0.20 and 21-30.

When found I want to add them all together and divide the value by the number of appearances to get the mean value: so I want for a region between 0.0-0.20 and 0-10 one value. I'm fairly new to python and I think that this is some kind of approach:

import os                                                      
import csv                                                     
dataList = []                                                  
with open("table.csv") as csv_file:                         
data_reader = csv.reader(csv_file, dialect='excel-tab')        
for rows in data_reader:                                       
    if float(rows[2]) <= 0.20:                                 
        if float(rows[0]) <= 10:                               
           print(rows)                                       
        if 10 <float(rows[0])<=20:                            
            print (rows)

That should work (without the print of course) to get the values, repeated than for if 20<float(rows[2])<= 30: ..... That should bring me the values I want but is there an easy way to set a range going from 0-1100 in 10 units step?

P.S.: I am aware that I gave lots of Info for a relative short question and that's because I don't really know if python is the best way to do this and if my approach is reasonable? Maybe I should go with panda but that I just installed. So in case anyone knows an easier (maybe not coding related) way to solve a problem like this I'd really appreciate it.

Upvotes: 1

Views: 79

Answers (1)

Ethan Furman
Ethan Furman

Reputation: 69041

You need a sorting key to put your data in the correct order; then you can cycle through it to group it and get your sums. itertools.groupby is excellent for this type of thing:

import os
import csv
from itertools import groupby

data_list = []
with open("table.csv") as csv_file:
    data_reader = csv.reader(csv_file, dialect='excel-tab')
    for row in data_reader:
        # read and convert data
        data_list.append((int(row[0]), float(row[1]), float(row[2])))

def classify(row):
    # returns classification of row based on third value, then first value
    # i.e. [549, 648.077, 0.01] -> (0.0, 540)
    #      [572, 648.141, 0.3]  -> (0.3, 570)
    first, second, third = row
    for level3 in (0.2, 0.4, 0.6, 0.8):
        if third <= level3:
            break
    else:
        level3 = 1.0
    for level1 in range(10, 1100, 10):
        if first <= level1:
            break
    else:
        level1 = 1100
    return level3, level1

data_list.sort(key=classify)

plot_values = []

for key, group in groupby(data_list, classify):
    group = list(group)
    number_of_elements = len(group)
    mean = sum([r[2] for r in group]) / number_of_elements
    plot_values.append((key, mean))

print('\n'.join(['%-10s -> %s' % (k, m) for k, m in plot_values]))

When run against the sample data you provided:

(0.2, 550) -> 0.01
(0.2, 560) -> 0.1
(0.4, 580) -> 0.3
(0.6, 560) -> 0.45
(0.8, 530) -> 0.69
(0.8, 560) -> 0.7

Upvotes: 1

Related Questions