thaking
thaking

Reputation: 3635

Python "input data"

I have file *.data, which include data in this order:

2.5,10,U1
3,4.5,U1
3,9,U1
3.5,5.5,U1
3.5,8,U1
4,7.5,U1
4.5,3.5,U1
4.5,4.5,U1
4.5,6,U1
5,5,U1
5,7,U1
7,6.5,U1
3.5,9.5,U2
3.5,10.5,U2
4.5,8,U2
4.5,10.5,U2
5,9,U2
5.5,5.5,U2
5.5,7.5,U2

In this data(I have different types of data, this is just example where are just 2 classes...), is 2 classes: U1 and U2, and for every class there is 2 values... What I need is to read this data and separate them to classes, in this case to U1 and U2.... Then after that I need to take from every class 2/3 data to new value(learning_set), and other 1/3 to other value(test_set).

I started with this code:

data = open('set.data', 'rt')                             
data_list=[]                                                   
border=2./3                                                  
data_list = [line.strip().split(',') for line in data]

learning_set=data_list[:int(round(len(data_list)*border))]
test_set=data_list[int(round(len(data_list)*border)):]

But there I take from all data 2/3 and 1/3, not from every class.

Many thanks for help

Upvotes: 0

Views: 649

Answers (5)

Rob Cowie
Rob Cowie

Reputation: 22619

For what it's worth (and because I've typed it out already), I'd accomplish this with something like...

from itertools import groupby
from operator import attrgetter
from collections import namedtuple

row_container = namedtuple('row', 'val1,val2,klass')

def process_row(row):
    """Return a named tuple"""
    return row_container(float(row[0]), float(row[1]), row[2])

def bisect_list(split_list, fraction):
    split_index = int(fraction * len(split_list))
    return split_list[:split_index], split_list[split_index:]


data = open('test.csv', 'rt')

## Parse & process each line
data = (row.strip().split(',') for row in data)
data = (process_row(row) for row in data)

## Sort & group the data by class
sorted_data = sorted(data, key=attrgetter('klass'))
grouped_data = groupby(sorted_data, attrgetter('klass'))

## For each class, create learning and test sets
final_data = {}
for klass, class_rows in grouped_data:
    learning_set, test_set = bisect_list(list(class_rows), 0.66)
    final_data[klass] = dict(learning=learning_set, test=test_set)

Method of operation is similar to other answers already provided. Uses namedtuple. bisectlist() lifted from @senderle

Upvotes: 2

senderle
senderle

Reputation: 151157

Ah, you want itertools.groupby:

import itertools
class_dict = dict(itertools.groupby(data_list, key=lambda x: x[-1]))
class_names = class_dict.keys()
class_lists = [list(group) for group in class_dict.values()]

Then just slice each list in class_lists appropriately and extend learning_set and test_set with the results.

Here's a full solution:

data_list = [line.strip().split(',') for line in data]
data_list.sort(key=lambda x: x[-1])

def bisect_list(split_list, fraction):
    split_index = int(fraction * len(split_list))
    return split_list[:split_index], split_list[split_index:]

learning_set, test_set = [], []
for key, group in itertools.groupby(data_list, key=lambda x: x[-1]):
    l, t = bisect_list(list(group), 0.66)
    learning_set.extend(l)
    test_set.extend(t)

Upvotes: 2

Howard
Howard

Reputation: 39217

You can filter your list after reading into two distinct subsets:

data_list_1 = [(x,y,c) for (x,y,c) in data_list if c=='U1']
data_list_2 = [(x,y,c) for (x,y,c) in data_list if c=='U2']

Afterwards you can then construct two different learing sets and test sets as before but on the filtered lists, e.g.

learning_set = data_list_1[:int(round(len(data_list_1)*border))] + data_list_2[:int(round(len(data_list_2)*border))]

and same for test_set.

Update: If you don't know the classes before you can use the following code to first detect all classes and then loop over them.

classes = set([t[-1] for t in data_list])

learning_set = []
test_set = []

for cl in classes:
    data_list_filtered = [t for t in data_list if t[-1]==cl]

    learning_set += data_list_filtered[:int(round(len(data_list_filtered)*border))]
    test_set += data_list_filtered[int(round(len(data_list_filtered)*border)):]

Upvotes: 2

matchew
matchew

Reputation: 19675

consider using a dict/hash instead of a list.

i'd write more, but I am having trouble comprehending what you want to do afterwards.

Upvotes: 1

MRAB
MRAB

Reputation: 20664

I would use a defaultdict to collect the entries into separate lists.

from collections import defaultdict

data = open(r'C:\Documents and Settings\Administrator\Desktop\set.data', 'r')
data_lists = defaultdict(list)
border = 2.0 / 3
for line in data:
    entries = line.strip().split(',')
    data_lists[entries[-1]].append(entries[ : -1])

learning_sets = {}
test_sets = {}
for cls, values in data_lists.items():
    pos = int(round(len(values) * border))
    learning_sets[cls] = values[ : pos]
    test_sets[cls] = values[pos : ]

for cls in learning_sets:
    print "for class", cls
    print "\tlearning set is", learning_sets[cls]
    print "\ttest set is", test_sets[cls]
    print

Upvotes: 1

Related Questions