Etienne Kintzler
Etienne Kintzler

Reputation: 682

Concatenate the values of two pandas series (that are not of str type) is too slow (linear complexity)

I have a pandas.DataFrame with two (or more) series that are not of type str (type float for instance). The output I want to have is a serie of type str that is the result of the concatenation of my series (of type float) with a given separator (for instance "-").

The following function build_df_ex build the example dataframe:

def build_df_ex(n):
    df_ex = pd.DataFrame({"s1": -abs(np.random.rand(int(n))),
                          "s2": +abs(np.random.rand(int(n)))})
    return df_ex

The function convert_to_str_and_add make the desired concatenation :

def convert_to_str_and_add(df, sep="-"):
    df = df.astype(str)
    s = df.s1 + sep + df.s2
    return s

My main problem is that this function has linear complexity (see the graph below) which is prohibitive in my case. The main bottleneck of the function is the conversion to str type. I have try to go the numpy way but I didn't see any gain in performance, probably because it is what pandas is already doing under the hood.

time complexity of the operation

Anyone has a solution that would make this operation faster ?

Thanks a lot

Upvotes: 0

Views: 49

Answers (1)

orlp
orlp

Reputation: 117771

You can not escape linear performance - your only hope is to show more of what you plan to do with the result to try and avoid extra work. What you have written is perfectly reasonable, you can try the following and see whether it has better performance (but I wouldn't be surprised if it doesn't).

df.apply(('{0[0]}' + sep + '{0[1]}').format, axis=1)

Upvotes: 1

Related Questions