Nathan
Nathan

Reputation: 1

How to repeat pandas dataframe records based on column value

I'm trying to duplicate rows of a pandas DataFrame (v.0.23.4, python v.3.7.1) based on an int value in one of the columns. I'm applying code from this question to do that, but I'm running into the following data type casting error: TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'. Basically, I'm not understanding why this code is attempting to cast to int32.

Starting with this,

dummy_dict = {'c1': ['a','b','c'],
              'c2': [0,1,2]}
dummy_df = pd.DataFrame(dummy_dict)
    c1  c2  c3
0   a   0   textA
1   b   1   textB
2   c   2   textC

I'm doing this

dummy_df_test = dummy_df.reindex(dummy_df.index.repeat(dummy_df['c2']))

I want this at the end. However, I'm getting the above error instead.

    c1  c2  c3
0   a   0   textA
1   b   1   textB
2   c   2   textC
3   c   2   textC

Upvotes: 0

Views: 186

Answers (3)

anky
anky

Reputation: 75150

Just a workaround:

pd.concat([dummy_df[dummy_df.c2.eq(0)],dummy_df.loc[dummy_df.index.repeat(dummy_df.c2)]])

Another fantastic suggestion courtesy @Wen

dummy_df.reindex(dummy_df.index.repeat(dummy_df['c2'].clip(lower=1)))

  c1  c2
0  a   0
1  b   1
2  c   2
2  c   2

Upvotes: 2

prosti
prosti

Reputation: 46479

In the first attempt all rows are duplicated, and in the second attempt just the row with the index 2. Thanks to the concat function.

df2 = pd.concat([df]*2, ignore_index=True)
print(df2)

df3= pd.concat([df, df.iloc[[2]]])
print(df3)

  c1  c2     c3
0  a   0  textA
1  b   1  textB
2  c   2  textC
  c1  c2     c3
0  a   0  textA
1  b   1  textB
2  c   2  textC
3  a   0  textA
4  b   1  textB
5  c   2  textC
  c1  c2     c3
0  a   0  textA
1  b   1  textB
2  c   2  textC
2  c   2  textC

If you plan to reset the index at the end

df3=df3.reset_index(drop=True)

Upvotes: 0

Chris
Chris

Reputation: 16172

I believe the answer as to why it's happening can be found here: https://github.com/numpy/numpy/issues/4384

Specifying the dtype as int32 should solve the problem as highlighted in the original comment.

Upvotes: 0

Related Questions