How to generate cross validation over different dataframes for supervised classification?

Question

Imagine I have 4 dataframes with different length of rows but same number of columns like this: df1(200 rows, 4 columns), df2(100, 4), df3(300, 4) and df4(250, 4).

I would like to make a supervised classification between these dataframes (always using 3 for training and 1 for test/validation) and discover which combination gives me the better accuracy score. This is an example of a bigger volume of data and I would like to automate it by making a cross validation.

I thought that I could try to create a new column for each dataframe with their specific name and then concat all of them. And then, maybe, create a mask that would differentiate the training and test sets by these new columns. But I still do not know how to do this cross validation between them.

The dataframes would be like this:

concatenated_dfs:

     feat1    feat2    feat3    feat4    name
0      4        6        57       78      df1
1      1        2        50       78      df1
2      1        1        57       78      df1
.      .        .        .         .       .
.      .        .        .         .       .
.      .        .        .         .       .
849    3        10       50       80      df4

Anyone could show me how to do that with some code? You can use any scikit-learn classification algorithm if you want. Thanks!

Toby Petty · Accepted Answer

You can use scikit learn's cross_val_score with a custom iterable to generate the indices for the training-test splits in your data. Here is an example:

# Setup - creating fake data to match your description
df = pd.DataFrame(data={"name":[x for l in [[f"df{i}"]*c for i, c in enumerate(counts, 1)] for x in l]})
for i in range(1, 5):
    df[f"feat{i}"] = np.random.randn(len(df))
X = df[[c for c in df.columns if c != "name"]]
y = np.random.randint(0, 2, len(df))

# Iterable to generate the training-test splits:
indices = list()
for name in df["name"].unique():
    train = df.loc[df["name"] != name].index
    test = df.loc[df["name"] == name].index
    indices.append((train, test))

# Example model - logistic regression
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

# Using cross-val score with the custom indices:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=indices)

How to generate cross validation over different dataframes for supervised classification?

Answers (1)

Related Questions