Reputation: 97
I am trying to create a new pandas dataframe based on conditions. This is the original dataframe:
topic1 topic2
name1 1 4
name2 4 4
name3 4 3
name4 4 4
name5 2 4
I want to select arbitrary rows so that topic1 == 4
appears 2 times and topic2 == 4
appears 3 times in the new dataframe. Once this is fulfilled, I want to stop the code.
bucket1_topic1 = 2
bucket1_topic2 = 3
I wrote this pretty convoluted starter that is 'almost' working...But I am having issues in dealing with rows that fulfil the conditions for both topic1 and topic2. What is the more efficent & correct way to do this?
rows_list = []
counter1 = 0
counter2 = 0
for index,row in data.iterrows():
if counter1 < bucket1_topic1:
if row.topic1 == 4:
counter1 +=1
rows_list.append([row[1], row.topic1, row.topic2])
if counter2 < bucket1_topic2:
if row.topic2 == 4 and row.topic1 !=4:
counter2 +=1
if [row[1], row.topic1, row.topic2] not in rows_list:
rows_list.append([row[1], row.topic1, row.topic2])
Desired result, where topic1 == 4
appears twice and topic2 == 4
appears 3 times:
topic1 topic2
name1 1 4
name2 4 4
name3 4 3
name5 2 4
Upvotes: 0
Views: 71
Reputation: 107587
Avoid looping and consider reshuffling rows arbitrarily with DataFrame.sample
(where frac=1
means return 100% fraction of data frame), then calculate running group counts using groupby().cumcount()
. Finally, filter with logical subsetting:
df = (df.sample(frac=1)
.assign(t1_grp = lambda x: x.groupby(["topic1"]).cumcount(),
t2_grp = lambda x: x.groupby(["topic2"]).cumcount())
)
final_df = df[(df["topic1"].isin([1,2,3])) |
(df["topic2"].isin([1,2,3])) |
((df["topic1"] == 4) & (df["t1_grp"] < 2)) |
((df["topic2"] == 4) & (df["t2_grp"] < 3))]
final_df = final_df.drop(columns=["t1_grp", "t2_grp"])
Upvotes: 1