Natheer Alabsi
Natheer Alabsi

Reputation: 2870

How to subset a dataset in pandas dataframe?

I have a dataframe which has more than 10 million raws composed of about 30 columns.

The first column is ID

ID   C
1    1
1    2
1    3
1    2
1    3
2    1
2    5
2    9
2    0
2    1

I would like to extract only the first four rows of each ID(they are the newest inputs as it is already sorted)

I am currently using the below code, but unfortunately it is so slow as it takes about two hours to process about 5% of the data and it may take a day or so to process the whole data.

df1 = pd.DataFrame() # an empty dataframe
for i in df.ID:   # df is the dataframe which contains the data
    df2 = df[df["ID"]== i] 
    df2 = df2[0:4] # take the first four rows
    df_f = df1.append(df2) 

Are there an effecient way to do the same thing in a shorter time.

Upvotes: 3

Views: 1020

Answers (1)

akuiper
akuiper

Reputation: 214957

You need the head() method:

df.groupby("ID").head(4)

enter image description here

Here is a revised version of your original code with run time testing against groupby().head() method:

def loop():
    df1 = pd.DataFrame() # an empty dataframe
    for i in df.ID.drop_duplicates():   # df is the dataframe which contains the data
        df2 = df[df["ID"]== i] 
        df2 = df2[0:4] # take the first four rows
        df1 = pd.concat([df1, df2])
    return df1

%timeit loop()
# 100 loops, best of 3: 1.99 ms per loop

%timeit df.groupby("ID").head(4)
# 1000 loops, best of 3: 485 µs per loop

Upvotes: 2

Related Questions