Fluxy
Fluxy

Reputation: 2978

How to duplicate rows of a DataFrame based on values of a specific column?

I have the following DataFrame df:

d = {'1': ['25', 'AAA', 2], '2': ['30', 'BBB', 3], '3': ['5', 'CCC', 2], \
     '4': ['300', 'DDD', 2], '5': ['30', 'DDD', 3],  '6': ['100', 'AAA', 3]}

columns=['Price', 'Name', 'Class']

df = pd.DataFrame.from_dict(data=d, orient='index')
df.columns = columns

I want to duplicate rows based on values of the column Class. In particular, I want to randomly select rows where Class is equal to 3, and duplicate them. For example, in current df I have 3 rows with Class equal to 3. How can I create N duplicates, where N is configurable, for example:

N = 2
target_column = "Class"
target_value = 3
new_df = create_duplicates(df, target_column, target_value, N)

I was thinking to use for-loop and at each iteration (when Class is equal to 3) generate a random number. If it's greater than 0.5, then the row is added to a list of selected rows. This process continues until a list of selected rows contains N rows. Then these N rows are appended to df.

Is there a more elegant and shorter way to do the same? Maybe some built-in pandas functions?

Upvotes: 0

Views: 1412

Answers (1)

james hendricks
james hendricks

Reputation: 134

I think this script below will do what you need. I lifted the repetition part from: Repeat Rows in Data Frame n Times

n=3
pd.concat([df,df[df['Class']==3].loc[df.index.repeat(n)].dropna()]).sort_values('Name')

Upvotes: 1

Related Questions