HaggarTheHorrible
HaggarTheHorrible

Reputation: 7423

Dato: What's the equivalent function for graphlab.random_split() in pandas?

I'm doing a course on Machine Learning on Coursera. In the course, it is emphasised that we use GraphLab from Dato. In one of the exercises, the instructor used graphlab.random_split() to split an SFrame, like this:

sales = graphlab.SFrame('home_data.gl/')
train_data, test_data = sales.random_split(.8,seed=0)

I've finished the first week's course and the quiz requires us to solve a problem using GraphLab and SFrame. I tried to install GraphLab, however, it requires a 64-bit PC and my PC is 32-bit. The instructor has given a choice to use Pandas if we feel like, so I've started using Pandas.

My problem is this, the instructor uses sales.random_split(.8,seed=0), which will give him train_data, test_data. He will use them for further analysis and he will arrive at an answer for that.

Now, if I don't use a pandas function which will split the data in the exact same way my answer will never match his and I can never pass this quiz. The pandas function I'm interested to use is:

train_data, test_data = pandas.DataFrame.sample(frac=0.8, random_state=0)

My question is this: Will pandas.DataFrame.sample(frac=0.8, random_state=0) produce same output as sales.random_split(.8,seed=0).

I've written to the instructor and I'm waiting for his reply, in the meantime if anyone can help me out, then kindly do. Thank you.

Upvotes: 1

Views: 2023

Answers (4)

Antoni Stavrev
Antoni Stavrev

Reputation: 1

When you random split the data via Dato library, using certain seed - it always splits the data set the same way. Thus you and the intructor will haev exactly the same values splitted in test and training sets.

If you use pandas to split your sets, you will not get the same result, thus you will not be able to submit a correct result.

Solution 1: Check the coursera course test details. When Pandas can be used, the intructor should have given you already split data intio train and dev sets, thus eliminating the need for you to do it, and having hte same split as if you are using dato random split with certain seed.

Solution 2 You can use Amazon compute engine for the course, where the ipython notebook is uploaded with the dato library. Here the trick is only to set up the license for dato to be for you.

Hope this helps!

Upvotes: 0

Queeq
Queeq

Reputation: 111

I am trying to complete the same course using Python3-sklearn-pandas combo. For this case it is possible to implement a dirty workaround: split the data in a separate script using SFrame and then pick it up from the main script:

import sframe

sf = sframe.SFrame.read_csv('../ml/home_data.csv')

train_data, test_data = sf.random_split(0.8, seed=0)

df_train = train_data.to_dataframe()
df_test = test_data.to_dataframe()

df_train.to_csv('../ml/home_train_data.csv')
df_test.to_csv('../ml/home_test_data.csv')

Afterwards, simply do pandas.read_csv() for training and test data within the main script.

In general, I made three inquiries to instructors/mentors within the last two weeks, but they were silently ignored. Thus de facto it is barely possible to use alternative tools for this course even though it is claimed otherwise.

Upvotes: 1

drosa
drosa

Reputation: 1

This is not an identical result, but a similar one from a probabilistic viewpoint

import graphlab as gl                                                                                                                                                                      
import pandas as pd
import numpy as np

seed=8
frac=0.8

df = pd.DataFrame({'a':np.arange(100), 'b':np.arange(100)[::-1]})
sf = gl.SFrame({'a':np.arange(100), 'b':np.arange(100)[::-1]})

glTrain,glTest=sf.random_split(frac,seed=seed)
pdTrain=df.sample(frac=frac,random_state=seed)
pdTest=df.loc[df.index.difference(pdTrain.index),:]

print(len(glTrain),len(glTest))
print(len(pdTrain),len(pdTest))

# there is randomness for the split itself in the SFrame
# for pandas, a similar thing can be done with

import random
random.seed(seed)
stdFactor=1./10
pdFrac=max(0.,min(1.,random.gauss(frac,frac*stdFactor)))
pdTrain=df.sample(frac=pdFrac,random_state=seed)
pdTest=df.loc[df.index.difference(pdTrain.index),:]
print(len(glTrain),len(glTest))
print(len(pdTrain),len(pdTest))

# if you loop over many splits from "random_split" and save the values you can calculate its variance and use it in "gauss" (if it is a gaussian, after all)

(74, 26)
(80, 20)
(74, 26)
(83, 17)

Upvotes: 0

Gustavo Bezerra
Gustavo Bezerra

Reputation: 11054

The closest equivalent is probably sklearn.cross_validation.train_test_split. However, it's behavior is NOT identical to SFrame.random_split. Quick check:

from __future__ import print_function
import numpy as np
import pandas as pd
import graphlab as gl
from sklearn.cross_validation import train_test_split

df = pd.DataFrame({'a':np.arange(100), 'b':np.arange(100)[::-1]})
sf = gl.SFrame({'a':np.arange(100), 'b':np.arange(100)[::-1]})

train_pd, test_pd = train_test_split(df, test_size=0.8, random_state=0)
train_gl, test_gl = sf.random_split(0.8, seed=0)

frames = [train_pd, test_pd, train_gl, test_gl]

print(*[len(f) for f in frames], end='\n\n')
print(*[f.head(3) for f in frames], sep='\n\n')

Output:

20 80 86 14

     a   b
25  25  74
37  37  62
81  81  18

     a   b
26  26  73
86  86  13
2    2  97

+---+----+
| a | b  |
+---+----+
| 0 | 99 |
| 1 | 98 |
| 2 | 97 |
+---+----+
[3 rows x 2 columns]


+----+----+
| a  | b  |
+----+----+
| 12 | 87 |
| 15 | 84 |
| 25 | 74 |
+----+----+
[3 rows x 2 columns]

Upvotes: 2

Related Questions