Huahua
Huahua

Reputation: 47

Is there any faster way in python to split strings to sublists in a list with 1million elements?

I'm trying to help my friend to clean an order list dataframe with one million elements.

enter image description here

you can see that the product_name column should be a list, but they are in string type. So I want to split them into sublists.

Here's my code:

order_ls = raw_df['product_name'].tolist()
cln_order_ls = list()
for i in order_ls:
    i = i.replace('[', '')
    i = i.replace(']', '')
    i = i.replace('\'', '')
    cln_order_ls.append(i)

new_cln_order_ls = list()
for i in cln_order_ls:
    new_cln_order_ls.append(i.split(', '))

But in the 'split' part, it took lots of time to process. I'm wondering is there faster way to deal with it ?

Thanks~

Upvotes: 0

Views: 3476

Answers (4)

norok2
norok2

Reputation: 26896

EDIT

(I did not like last answer, it was too much confused, so I reordered it and tested I little bit more systematically).

Long story short:

For speed, just use:

def str_to_list(s):
    return s[1:-1].replace('\'', '').split(', ')


df['product_name'].apply(str_to_list).to_list()

Long story long:

Let's dissect your code:

order_ls = raw_df['product_name'].tolist()
cln_order_ls = list()
for i in order_ls:
    i = i.replace('[', '')
    i = i.replace(']', '')
    i = i.replace('\'', '')
    cln_order_ls.append(i)

new_cln_order_ls = list()
for i in cln_order_ls:
    new_cln_order_ls.append(i.split(', '))

What you would really like to do is to have a function, say str_to_list() which converts your input string to a list.

For some reasons, you do it in multiple steps, but this is really not necessary. What you have so far, can be rewritten as:

def str_to_list_OP(s):
    return s.replace('[', '').replace(']', '').replace('\'', '').split(', ')

If you can assume that [ and ] are always the first and last char of your string, you can simplify this to:

def str_to_list(s):
    return s[1:-1].replace('\'', '').split(', ')

which should also be faster.

Alternative approaches would use regular expressions, e.g.:

def str_to_list_regex(s):
    regex = re.compile(r'[\[\]\']')
    return re.sub(regex, '', s).split(', ')

Note that all approaches so far use split(). This is a quite fast implementation which approach C speed and hardly any Python construct would beat it.

All these methods are quite unsafe as they do not take into account escaping properly, e.g. all of the above would fail for the following valid Python code:

['ciao', "pippo", 'foo, bar']

More robust alternative in this scenario would be:

  1. ast.literal_eval which works for any valid Python code
  2. json.loads which actually requires valid JSON strings so it is not really an option here.

The speed for these solutions is compared here:

benchmark1

As you can see, safety comes at the price of speed.

(these graphs are generated using these scripts with the following

def gen_input(n):
    return str([str(x) for x in range(n)])


def equal_output(a, b):
    return a == b


input_sizes = (5, 10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000, 500000)  
funcs = str_to_list_OP, str_to_list, str_to_list_regex, ast.literal_eval 


runtimes, input_sizes, labels, results = benchmark(
    funcs, gen_input=gen_input, equal_output=equal_output,
    input_sizes=input_sizes)

Now let's concentrate to the looping. What you do is an explicit looping, and we know that Python is typically not terribly fast with that. However, looping in a comprehension can be faster because it can generate more optimized code. Another approach would be to use a vectorized expression using Pandas primitives, either using apply() or with .str. chainings.

The following timings are obtained, indicating comprehensions to be the fastest for smaller inputs, although the vectorized solution (using apply) catches up and eventually surpasses the comprehension:

benchmark2

The following test functions were used:

import pandas as pd


def str_to_list(s):
    return s[1:-1].replace('\'', '').split(', ')


def func_OP(df):
    order_ls = df['product_name'].tolist()
    cln_order_ls = list()
    for i in order_ls:
        i = i.replace('[', '')
        i = i.replace(']', '')
        i = i.replace('\'', '')
        cln_order_ls.append(i)
    new_cln_order_ls = list()
    for i in cln_order_ls:
        new_cln_order_ls.append(i.split(', '))
    return new_cln_order_ls


def func_QuangHoang(df):
    return df['product_name'].str[1:-1].str.replace('\'','').str.split(', ').to_list()


def func_apply_df(df):
    return df['product_name'].apply(str_to_list).to_list()
    
    
def func_compr(df):
    return [str_to_list(s) for s in df['product_name']]

with the following test code:

def gen_input(n):
    return pd.DataFrame(
        columns=('order_id', 'product_name'),
        data=[[i, "['ciao', 'pippo', 'foo', 'bar', 'baz']"] for i in range(n)])


def equal_output(a, b):
    return a == b


input_sizes = (5, 10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000, 500000)  
funcs = func_OP, func_QuangHoang, func_apply_df, func_compr 


runtimes, input_sizes, labels, results = benchmark(
    funcs, gen_input=gen_input, equal_output=equal_output,
    input_sizes=input_sizes)

again using the same base scripts as before.

Upvotes: 4

Andy L.
Andy L.

Reputation: 25259

I am curious about list comp as anky_91, so I gave it a try. I do list comp directly on ndarray to save time on calling tolist

n = raw_df['product_name'].values
[x[1:-1].replace('\'', '').split(', ') for x in n]

Sample data:

In [1488]: raw_df.values
Out[1488]:
array([["['C1', 'None', 'None']"],
       ["['C1', 'C2', 'None']"],
       ["['C1', 'C1', 'None']"],
       ["['C1', 'C2', 'C3']"]], dtype=object)


In [1491]: %%timeit
      ...: n = raw_df['product_name'].values
      ...: [x[1:-1].replace('\'', '').split(', ') for x in n]
      ...:
16.2 µs ± 614 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [1494]: %timeit my_func_2b(raw_df)
36.1 µs ± 489 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [1493]: %timeit my_func_2(raw_df)
39.1 µs ± 2.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [1492]: %timeit raw_df['product_name'].str[1:-1].str.replace('\'','').str.sp
      ...: lit(', ').tolist()
765 µs ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So, listcomp with chain replace and split is fastest. Its speed is twice the next one. However, the save time is actually on using ndarray without calling tolist. If I add tolist, differences is not significant.

Upvotes: 0

Yash Thenuan
Yash Thenuan

Reputation: 631

Try this

import ast
raw_df['product_name'] = raw_df['product_name'].apply(lambda x : ast.literal_eval(x))

Upvotes: 1

Quang Hoang
Quang Hoang

Reputation: 150775

How about:

(df['product_name']
   .str[1:-1]
   .str.replace('\'','')
   .str.split(', ')
)

Upvotes: 3

Related Questions