TAmoel
TAmoel

Reputation: 23

Python: Select Rows by value in large dataframe

Given a data frame df:

 Column A: [0, 1, 3, 4, 6]

 Column B: [0, 0, 0, 0, 0]

The goal is to conditionally replace values in column B. If column A's values exist in a set assginedToA, we replace the corresponding values in column B with a constant b.

For example: if b=1 and assignedToA={1,4}, the result would be

Column A: [0, 1, 3, 4, 6]

Column B: [0, 1, 0, 1, 0]

My code for finding the A values and write B values into it looks like this:

df.loc[df['A'].isin(assignedToA),'B']=b

This code works, but it is really slow for a huge dataframe. Do you have any advice, how to speed this process up?


The dataframe df has around 5 Million rows and assignedToA has a maximum of 7 values.

Upvotes: 2

Views: 435

Answers (1)

jpp
jpp

Reputation: 164693

You may find a performance improvement by dropping down to numpy:

df = pd.DataFrame({'A': [0, 1, 3, 4, 6],
                   'B': [0, 0, 0, 0, 0]})

def jp(df, vals, k):
    B = df['B'].values
    B[np.in1d(df['A'], list(vals))] = k
    df['B'] = B
    return df

def original(df, vals, k):
    df.loc[df['A'].isin(vals),'B'] = k
    return df

df = pd.concat([df]*100000)

%timeit jp(df, {1, 4}, 1)        # 8.55ms
%timeit original(df, {1, 4}, 1)  # 16.6ms

Upvotes: 2

Related Questions