Reputation: 2005

Add columns of mismatched length to a dataframe in julia

I am trying to add columns of having mismatched length(number of rows) to a dataframe, it throws an error of,

DimensionMismatch("length of new column Target which is 60000 must match the number of rows in data frame (47040000)")

My code snippet is,

df = DataFrame(:Feature => train_x, :Target => train_y)

#train_x has 47040000 rows
#train_y has 60000 rows

Please suggest a solution for this problem. Thank you in advance.

Upvotes: 6

Answers (2)

Przemyslaw Szufel

Reputation: 42264

Since a DataFrame is actually a set of columns this is possible:

df = DataFrame(x=Int[],y=Int[])
append!(df.x,[1,2])
append!(df.y,[1,2,3])

However, since such data frame does not make sense, you will not be able to work with it via the standard DataFrames API (it will be seen as a corrupt DataFrame):

julia> df[1,:]
DataFrameRowError showing value of type DataFrameRow{DataFrame,DataFrames.Index}:
ERROR: AssertionError: Data frame is corrupt: length of column :y (3) does not match length of column 1 (2). The column vector has likely been resized unintentionally (either directly or because it is shared with another data frame).

Upvotes: 4

Nils Gudat

Reputation: 13800

Are you sure this is what you're trying to do? Normally one would expect that there are a many rows of features as there are rows of the target column, so this error might point to a conceptual issue in your code.

If you absolutely have to do this though, I see two options:

pad out the shorter vector with missing or some value of your choice, so :Target => [train_y; [missing for _ in length(train_x) - length(train_y)] . Here I'm padding at the end of the vector, which might or might not be appropriate in your case
perform a leftjoin of a dataframe with your train_x column onto a dataframe with your train_y column - for this you will need an ibex column in both DataFrames that describes how the rows of y match to x. If you just add a running index 1:length(train_*) to both DataFrames the result will be the same as padding the end of train_y with missing

Upvotes: 5

Add columns of mismatched length to a dataframe in julia

Answers (2)

Related Questions