Reputation: 2870
I have a dataframe which has more than 10 million raws composed of about 30 columns.
The first column is ID
ID C
1 1
1 2
1 3
1 2
1 3
2 1
2 5
2 9
2 0
2 1
I would like to extract only the first four rows of each ID(they are the newest inputs as it is already sorted)
I am currently using the below code, but unfortunately it is so slow as it takes about two hours to process about 5% of the data and it may take a day or so to process the whole data.
df1 = pd.DataFrame() # an empty dataframe
for i in df.ID: # df is the dataframe which contains the data
df2 = df[df["ID"]== i]
df2 = df2[0:4] # take the first four rows
df_f = df1.append(df2)
Are there an effecient way to do the same thing in a shorter time.
Upvotes: 3
Views: 1020
Reputation: 214957
You need the head()
method:
df.groupby("ID").head(4)
Here is a revised version of your original code with run time testing against groupby().head()
method:
def loop():
df1 = pd.DataFrame() # an empty dataframe
for i in df.ID.drop_duplicates(): # df is the dataframe which contains the data
df2 = df[df["ID"]== i]
df2 = df2[0:4] # take the first four rows
df1 = pd.concat([df1, df2])
return df1
%timeit loop()
# 100 loops, best of 3: 1.99 ms per loop
%timeit df.groupby("ID").head(4)
# 1000 loops, best of 3: 485 µs per loop
Upvotes: 2