Reputation: 55
In order to prevent my machine learning algorithm from tending to a certain data, I want to reduce the frequency differences in my dataset, which is a pandas table,
for example, in column X;
Is there a way to get 1250 of them all?
Upvotes: 2
Views: 91
Reputation: 128
A solution assuming you may have an unknown number of unique values:
import pandas as pd
# Creating a Panda dafatframme with the number of elements
d = {'X': 1500*["A"]+3000*["B"]+1300*["C"]}
df = pd.DataFrame(data=d)
# Create a dictionary containing 1 dataframe for each unique value
dfDict = dict(iter(df.groupby('X')))
# Keep only the first n values for each and add them to filtered dataframe
for unique_val in dfDict:
dfDict[unique_val] = dfDict[unique_val][:1250]
filetered = pd.concat(dfDict, ignore_index=True)
Upvotes: 0
Reputation: 763
You can group the table according to the column you want to set the frequency of ("X" for your example) and get as many data as you want with the head function (if there is less of a value than the frequency you have given, it will take them all)
df = df.groupby('X').head(1250)
Upvotes: 1
Reputation: 4608
can you try this:
df2=pd.concat(df[df['X']=='A'][:1250],df[df['X']=='B'][:1250],df[df['X']=='C'][:1250])
Upvotes: 1