Reputation: 3635
I have file *.data, which include data in this order:
2.5,10,U1
3,4.5,U1
3,9,U1
3.5,5.5,U1
3.5,8,U1
4,7.5,U1
4.5,3.5,U1
4.5,4.5,U1
4.5,6,U1
5,5,U1
5,7,U1
7,6.5,U1
3.5,9.5,U2
3.5,10.5,U2
4.5,8,U2
4.5,10.5,U2
5,9,U2
5.5,5.5,U2
5.5,7.5,U2
In this data(I have different types of data, this is just example where are just 2 classes...), is 2 classes: U1 and U2, and for every class there is 2 values... What I need is to read this data and separate them to classes, in this case to U1 and U2.... Then after that I need to take from every class 2/3 data to new value(learning_set), and other 1/3 to other value(test_set).
I started with this code:
data = open('set.data', 'rt')
data_list=[]
border=2./3
data_list = [line.strip().split(',') for line in data]
learning_set=data_list[:int(round(len(data_list)*border))]
test_set=data_list[int(round(len(data_list)*border)):]
But there I take from all data 2/3 and 1/3, not from every class.
Many thanks for help
Upvotes: 0
Views: 649
Reputation: 22619
For what it's worth (and because I've typed it out already), I'd accomplish this with something like...
from itertools import groupby
from operator import attrgetter
from collections import namedtuple
row_container = namedtuple('row', 'val1,val2,klass')
def process_row(row):
"""Return a named tuple"""
return row_container(float(row[0]), float(row[1]), row[2])
def bisect_list(split_list, fraction):
split_index = int(fraction * len(split_list))
return split_list[:split_index], split_list[split_index:]
data = open('test.csv', 'rt')
## Parse & process each line
data = (row.strip().split(',') for row in data)
data = (process_row(row) for row in data)
## Sort & group the data by class
sorted_data = sorted(data, key=attrgetter('klass'))
grouped_data = groupby(sorted_data, attrgetter('klass'))
## For each class, create learning and test sets
final_data = {}
for klass, class_rows in grouped_data:
learning_set, test_set = bisect_list(list(class_rows), 0.66)
final_data[klass] = dict(learning=learning_set, test=test_set)
Method of operation is similar to other answers already provided. Uses namedtuple. bisectlist()
lifted from @senderle
Upvotes: 2
Reputation: 151157
Ah, you want itertools.groupby
:
import itertools
class_dict = dict(itertools.groupby(data_list, key=lambda x: x[-1]))
class_names = class_dict.keys()
class_lists = [list(group) for group in class_dict.values()]
Then just slice each list in class_lists
appropriately and extend
learning_set and test_set with the results.
Here's a full solution:
data_list = [line.strip().split(',') for line in data]
data_list.sort(key=lambda x: x[-1])
def bisect_list(split_list, fraction):
split_index = int(fraction * len(split_list))
return split_list[:split_index], split_list[split_index:]
learning_set, test_set = [], []
for key, group in itertools.groupby(data_list, key=lambda x: x[-1]):
l, t = bisect_list(list(group), 0.66)
learning_set.extend(l)
test_set.extend(t)
Upvotes: 2
Reputation: 39217
You can filter your list after reading into two distinct subsets:
data_list_1 = [(x,y,c) for (x,y,c) in data_list if c=='U1']
data_list_2 = [(x,y,c) for (x,y,c) in data_list if c=='U2']
Afterwards you can then construct two different learing sets and test sets as before but on the filtered lists, e.g.
learning_set = data_list_1[:int(round(len(data_list_1)*border))] + data_list_2[:int(round(len(data_list_2)*border))]
and same for test_set
.
Update: If you don't know the classes before you can use the following code to first detect all classes and then loop over them.
classes = set([t[-1] for t in data_list])
learning_set = []
test_set = []
for cl in classes:
data_list_filtered = [t for t in data_list if t[-1]==cl]
learning_set += data_list_filtered[:int(round(len(data_list_filtered)*border))]
test_set += data_list_filtered[int(round(len(data_list_filtered)*border)):]
Upvotes: 2
Reputation: 19675
consider using a dict/hash instead of a list.
i'd write more, but I am having trouble comprehending what you want to do afterwards.
Upvotes: 1
Reputation: 20664
I would use a defaultdict to collect the entries into separate lists.
from collections import defaultdict
data = open(r'C:\Documents and Settings\Administrator\Desktop\set.data', 'r')
data_lists = defaultdict(list)
border = 2.0 / 3
for line in data:
entries = line.strip().split(',')
data_lists[entries[-1]].append(entries[ : -1])
learning_sets = {}
test_sets = {}
for cls, values in data_lists.items():
pos = int(round(len(values) * border))
learning_sets[cls] = values[ : pos]
test_sets[cls] = values[pos : ]
for cls in learning_sets:
print "for class", cls
print "\tlearning set is", learning_sets[cls]
print "\ttest set is", test_sets[cls]
print
Upvotes: 1