Reputation: 75
Can anyone explain me these 2 different types of train test split. I know the first one. The second one I saw it on someones code.
train_text, temp_text, train_labels, temp_labels = train_test_split(df['text'], df['spam'],
random_state=42,
test_size=0.3,
stratify=df['spam'])
df_train, df_valid = model_selection.train_test_split(
text, test_size=0.1,
random_state=42, stratify=data.spam.values)
In the second example why did the person take only 2 variables instead of 4
Upvotes: 1
Views: 360
Reputation: 11807
why did the person take only 2 variables instead of 4
The reason is that the train_test_split
takes two types of parameters - *arrays
and **options
- and:
In the first example the *arrays
are df['text'], df['spam']
. The rest of the arguments (the **options
) are not relevant to the question. So, the function receives two arrays - df['text']
and df['spam']
and it produces train&test splits for each of the two arrays, therefore it has 4 results.
In the second example, only one array is provided - text
so the function returns only two results - the train and the test splits of the text
.
The documentation of the function states that it returns:
splitting: list, length=2 * len(arrays) List containing train-test split of inputs.
Upvotes: 1
Reputation: 1
According to this. If you use stratify
the data will be split using the value of stratify as class labels in a stratified fashion. Which helps in class distribution.
If so since in both the first and second example stratify
is not None
, the data will be stratified.
Upvotes: 0