Reputation: 702
Imagine I have 4 dataframes with different length of rows but same number of columns like this: df1(200 rows, 4 columns), df2(100, 4), df3(300, 4) and df4(250, 4).
I would like to make a supervised classification between these dataframes (always using 3 for training and 1 for test/validation) and discover which combination gives me the better accuracy score. This is an example of a bigger volume of data and I would like to automate it by making a cross validation.
I thought that I could try to create a new column for each dataframe with their specific name and then concat all of them. And then, maybe, create a mask that would differentiate the training and test sets by these new columns. But I still do not know how to do this cross validation between them.
The dataframes would be like this:
concatenated_dfs:
feat1 feat2 feat3 feat4 name
0 4 6 57 78 df1
1 1 2 50 78 df1
2 1 1 57 78 df1
. . . . . .
. . . . . .
. . . . . .
849 3 10 50 80 df4
Anyone could show me how to do that with some code? You can use any scikit-learn classification algorithm if you want. Thanks!
Upvotes: 0
Views: 653
Reputation: 4670
You can use scikit learn's cross_val_score
with a custom iterable to generate the indices for the training-test splits in your data. Here is an example:
# Setup - creating fake data to match your description
df = pd.DataFrame(data={"name":[x for l in [[f"df{i}"]*c for i, c in enumerate(counts, 1)] for x in l]})
for i in range(1, 5):
df[f"feat{i}"] = np.random.randn(len(df))
X = df[[c for c in df.columns if c != "name"]]
y = np.random.randint(0, 2, len(df))
# Iterable to generate the training-test splits:
indices = list()
for name in df["name"].unique():
train = df.loc[df["name"] != name].index
test = df.loc[df["name"] == name].index
indices.append((train, test))
# Example model - logistic regression
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
# Using cross-val score with the custom indices:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=indices)
Upvotes: 2