Reputation: 775
I have around 3000 objects where each object has a count associated with it. I want to randomly divide these objects in training and testing data with a 70% training and 30% testing split. But, I want to divide them based on the count associated with each object but not based on the number of objects.
An example, assuming my dataset contains 5 objects.
Obj 1 => 200
Obj 2 => 30
Obj 3 => 40
Obj 4 => 20
Obj 5 => 110
If I split them with a nearly 70%-30% ratio, my training set should be
Obj 2 => 30
Obj 3 => 40
Obj 4 => 20
Obj 5 => 110
and my testing set would be
Obj 1 => 200
If I split them again, I should get a different training and testing set nearing the 70-30 split ratio. I understand the above split does not give me pure 70-30 split but as long as it nears it, it's acceptable.
Are there any predefined methods/packages to do this in Python?
Upvotes: 1
Views: 2733
Reputation: 639
I do not know if there is a specific function in Python, but assuming there isn't, here is an approach.
Shuffle objects:
from random import shuffle
values = shuffle[200, 40, 30, 110, 20]
Calculate percentage of dictionary values:
prob = [float(i)/sum(values) for i in values]
Apply a loop:
sum=0
for i in range(len(result)):
if sum>0.7:
index=i-1
break
sum=sum+result[i]
Now, objects before index are training objects and after it are testing objects.
Upvotes: 0
Reputation: 2711
Assuming I understand your question correctly, my suggestion would be this:
from random import shuffle
sum = sum([obj.count for obj in obj_list]) #Get the total "count" of all the objects, O(n)
shuffle(obj_list)
running_sum = 0
i = 0
while running_sum < sum * .3
running_sum += obj_list[i].count
i += 1
training_data = obj_list[i:]
testing_data = obj_list[:i]
This entire operation is O(n), you're not going to get any better time complexity than that. There's certainly ways to condense the loop and whatnot into one liners, but I don't know of any builtins that accomplish what you're asking with a single function, especially not when you're asking it to be "random" in the sense that you want a different training/testing set each time you split it (as I understand the question)
Upvotes: 2