Reputation: 984
I am working with a dataset that contains in its first column, emotion or category labels. However, since the dataset is unbalanced, I need to extract the same number of rows for each category. That is, if there are 10 categories, I need to select only 100 rows samples from each of those categories. The result would be 1000 rows samples.
def append_new_rows(df, new_df, s):
c = 0
for index, row in df.iterrows():
if s == row[0]:
if c <= 100:
new_df.append(row)
c += 1
return df_2
for s in sorted(list(set(df.category))):
new_df = append_new_rows(df, new_df, s)
----------------------------
| category | A | B | C | D |
----------------------------
| happy | ...| ...|...|...|
| ... | ...| ...|...|...|
| sadness | ...| ...|...|...|
----------------------------
| category | A | B | C | D |
----------------------------
| happy | ...| ...|...|...|
... 100 samples of happy
| ... | ...| ...|...|...|
| sadness | ...| ...|...|...|
... 100 samples of sadness
...
...
1000 sampple rows
Upvotes: 0
Views: 49
Reputation: 78
def append_new_df(df, df_2, s, n):
c = 1
for index, row in df.iterrows():
if s == row[0]:
if c <= n:
df_2 = df_2.append(row)
c += 1
return df_2
you are just there, you just need to do something like this
Upvotes: 2