Tom Malkin
Tom Malkin

Reputation: 2284

Fastest method of finding and replacing row-specific data in a pandas DataFrame

Given an example pandas DataFrame:

Index | sometext | a | ff |
    0     'asdff' 'b'  'g' 
    1     'asdff' 'c'  'hh'
    2     'aaf'   'd'  'i'

What would be the fastest way to replace all instances of the columns names in the [sometext] field with the data in that column, where the values to replace are row specific?

i.e. the desired result from the above input would be:

Index | sometext | a | ff |
    0     'bsdg'  'b'  'g' 
    1     'csdhh' 'c'  'hh'
    2     'ddf'   'd'  'i'

note: there is no chance the replacement values would include column names.

I have tried iterating over the rows but the execution time blows out as the length of the DataFrame and number of replacement columns increases.

the Series.str.replace method looks at single values as well so would need to be run over each row.

Upvotes: 1

Views: 1827

Answers (4)

Tom Malkin
Tom Malkin

Reputation: 2284

The fastest method I found was to use the apply function in tandem with a replacer function that uses the basic str.replace() method. It's very fast, even with a for loop inside it, and it also allows for a dynamic amount of columns:

def value_replacement(df_to_replace, replace_col):
    """ replace the <replace_col> column of a dataframe with the values in all other columns """

    cols = [col for col in df_to_replace.columns if col != replace_col]

    def replacer(rep_df):
        """ function to by used in the apply function """
        for col in cols:
            rep_df[replace_col] = \
                str(rep_df[replace_col]).replace(col.lower(), str(rep_df[col]))

        return rep_df[replace_col]

    df_to_replace[replace_col] = df_to_replace.apply(replacer, axis=1)

    return df_to_replace

Upvotes: -1

JohnE
JohnE

Reputation: 30424

This way seems quite fast. See below for a brief discussion.

import re

df['new'] = df['sometext']
for v in ['a','ff']:
   df['new'] = df.apply( lambda x: re.sub( v, x[v], x['new']), axis=1 )

Results:

  sometext  a  ff    new
0    asdff  b   g   bsdg
1    asdff  c  hh  csdhh
2      aaf  d   i    ddf

Discussion:

I expanded the sample to 15,000 rows and this was the fastest approach by around 10x or more compared to the existing answers (although I suspect there might be even faster ways).

The fact that you want to use the columns to make row specific substitutions is what complicates this answer (otherwise you would just do a simpler version of @wen's df.replace). As it is, that simple and fast approach requires further code in both my approach and wen's although I think they are more or less working the same way.

Upvotes: 2

neutralCreep
neutralCreep

Reputation: 77

I have the following:

d = {'sometext': ['asdff', 'asdff', 'aaf'], 'a': ['b', 'c', 'd'],  'ff':['g', 'hh', 'i']}
df = pd.DataFrame(data=d)

start = timeit.timeit()

def replace_single_string(row_label, original_column, final_column):
    result_1 =  df.get_value(row_label, original_column)
    result_2 = df.get_value(row_label, final_column)
    if 'a' in result_1:
        df.at[row_label, original_column] = result_1.replace('a', result_2)
    else:
        pass
    return df


for i in df.index.values:
    df = replace_single_string(i, 'sometext', 'a')

print df

end = timeit.timeit()
print end - start

This ran in 0.000404119491577 seconds in Terminal.

Upvotes: 0

BENY
BENY

Reputation: 323306

We can do this ..

df.apply(lambda x : pd.Series(x['sometext']).replace({'a':x['a'],'ff':x['ff']},regex=True),1)


Out[773]: 
       0
0   bsdg
1  csdhh
2    ddf

Upvotes: 2

Related Questions