Python: Select Rows by value in large dataframe

Question

Given a data frame df:

 Column A: [0, 1, 3, 4, 6]

 Column B: [0, 0, 0, 0, 0]

The goal is to conditionally replace values in column B. If column A's values exist in a set assginedToA, we replace the corresponding values in column B with a constant b.

For example: if b=1 and assignedToA={1,4}, the result would be

Column A: [0, 1, 3, 4, 6]

Column B: [0, 1, 0, 1, 0]

My code for finding the A values and write B values into it looks like this:

df.loc[df['A'].isin(assignedToA),'B']=b

This code works, but it is really slow for a huge dataframe. Do you have any advice, how to speed this process up?

The dataframe df has around 5 Million rows and assignedToA has a maximum of 7 values.

jpp · Accepted Answer

You may find a performance improvement by dropping down to numpy:

df = pd.DataFrame({'A': [0, 1, 3, 4, 6],
                   'B': [0, 0, 0, 0, 0]})

def jp(df, vals, k):
    B = df['B'].values
    B[np.in1d(df['A'], list(vals))] = k
    df['B'] = B
    return df

def original(df, vals, k):
    df.loc[df['A'].isin(vals),'B'] = k
    return df

df = pd.concat([df]*100000)

%timeit jp(df, {1, 4}, 1)        # 8.55ms
%timeit original(df, {1, 4}, 1)  # 16.6ms

Python: Select Rows by value in large dataframe

Answers (1)

Related Questions