Reputation: 47
I'm trying to help my friend to clean an order list dataframe with one million elements.
you can see that the product_name column should be a list, but they are in string type. So I want to split them into sublists.
Here's my code:
order_ls = raw_df['product_name'].tolist()
cln_order_ls = list()
for i in order_ls:
i = i.replace('[', '')
i = i.replace(']', '')
i = i.replace('\'', '')
cln_order_ls.append(i)
new_cln_order_ls = list()
for i in cln_order_ls:
new_cln_order_ls.append(i.split(', '))
But in the 'split' part, it took lots of time to process. I'm wondering is there faster way to deal with it ?
Thanks~
Upvotes: 0
Views: 3476
Reputation: 26896
(I did not like last answer, it was too much confused, so I reordered it and tested I little bit more systematically).
For speed, just use:
def str_to_list(s):
return s[1:-1].replace('\'', '').split(', ')
df['product_name'].apply(str_to_list).to_list()
Let's dissect your code:
order_ls = raw_df['product_name'].tolist()
cln_order_ls = list()
for i in order_ls:
i = i.replace('[', '')
i = i.replace(']', '')
i = i.replace('\'', '')
cln_order_ls.append(i)
new_cln_order_ls = list()
for i in cln_order_ls:
new_cln_order_ls.append(i.split(', '))
What you would really like to do is to have a function, say str_to_list()
which converts your input str
ing to a list
.
For some reasons, you do it in multiple steps, but this is really not necessary. What you have so far, can be rewritten as:
def str_to_list_OP(s):
return s.replace('[', '').replace(']', '').replace('\'', '').split(', ')
If you can assume that [
and ]
are always the first and last char of your string, you can simplify this to:
def str_to_list(s):
return s[1:-1].replace('\'', '').split(', ')
which should also be faster.
Alternative approaches would use regular expressions, e.g.:
def str_to_list_regex(s):
regex = re.compile(r'[\[\]\']')
return re.sub(regex, '', s).split(', ')
Note that all approaches so far use split()
. This is a quite fast implementation which approach C speed and hardly any Python construct would beat it.
All these methods are quite unsafe as they do not take into account escaping properly, e.g. all of the above would fail for the following valid Python code:
['ciao', "pippo", 'foo, bar']
More robust alternative in this scenario would be:
ast.literal_eval
which works for any valid Python codejson.loads
which actually requires valid JSON strings so it is not really an option here.The speed for these solutions is compared here:
As you can see, safety comes at the price of speed.
(these graphs are generated using these scripts with the following
def gen_input(n):
return str([str(x) for x in range(n)])
def equal_output(a, b):
return a == b
input_sizes = (5, 10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000, 500000)
funcs = str_to_list_OP, str_to_list, str_to_list_regex, ast.literal_eval
runtimes, input_sizes, labels, results = benchmark(
funcs, gen_input=gen_input, equal_output=equal_output,
input_sizes=input_sizes)
Now let's concentrate to the looping. What you do is an explicit looping, and we know that Python is typically not terribly fast with that.
However, looping in a comprehension can be faster because it can generate more optimized code.
Another approach would be to use a vectorized expression using Pandas primitives, either using apply()
or with .str.
chainings.
The following timings are obtained, indicating comprehensions to be the fastest for smaller inputs, although the vectorized solution (using apply
) catches up and eventually surpasses the comprehension:
The following test functions were used:
import pandas as pd
def str_to_list(s):
return s[1:-1].replace('\'', '').split(', ')
def func_OP(df):
order_ls = df['product_name'].tolist()
cln_order_ls = list()
for i in order_ls:
i = i.replace('[', '')
i = i.replace(']', '')
i = i.replace('\'', '')
cln_order_ls.append(i)
new_cln_order_ls = list()
for i in cln_order_ls:
new_cln_order_ls.append(i.split(', '))
return new_cln_order_ls
def func_QuangHoang(df):
return df['product_name'].str[1:-1].str.replace('\'','').str.split(', ').to_list()
def func_apply_df(df):
return df['product_name'].apply(str_to_list).to_list()
def func_compr(df):
return [str_to_list(s) for s in df['product_name']]
with the following test code:
def gen_input(n):
return pd.DataFrame(
columns=('order_id', 'product_name'),
data=[[i, "['ciao', 'pippo', 'foo', 'bar', 'baz']"] for i in range(n)])
def equal_output(a, b):
return a == b
input_sizes = (5, 10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000, 500000)
funcs = func_OP, func_QuangHoang, func_apply_df, func_compr
runtimes, input_sizes, labels, results = benchmark(
funcs, gen_input=gen_input, equal_output=equal_output,
input_sizes=input_sizes)
again using the same base scripts as before.
Upvotes: 4
Reputation: 25259
I am curious about list comp as anky_91, so I gave it a try. I do list comp directly on ndarray to save time on calling tolist
n = raw_df['product_name'].values
[x[1:-1].replace('\'', '').split(', ') for x in n]
Sample data:
In [1488]: raw_df.values
Out[1488]:
array([["['C1', 'None', 'None']"],
["['C1', 'C2', 'None']"],
["['C1', 'C1', 'None']"],
["['C1', 'C2', 'C3']"]], dtype=object)
In [1491]: %%timeit
...: n = raw_df['product_name'].values
...: [x[1:-1].replace('\'', '').split(', ') for x in n]
...:
16.2 µs ± 614 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [1494]: %timeit my_func_2b(raw_df)
36.1 µs ± 489 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [1493]: %timeit my_func_2(raw_df)
39.1 µs ± 2.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [1492]: %timeit raw_df['product_name'].str[1:-1].str.replace('\'','').str.sp
...: lit(', ').tolist()
765 µs ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So, listcomp with chain replace
and split
is fastest. Its speed is twice the next one. However, the save time is actually on using ndarray without calling tolist
. If I add tolist
, differences is not significant.
Upvotes: 0
Reputation: 631
Try this
import ast
raw_df['product_name'] = raw_df['product_name'].apply(lambda x : ast.literal_eval(x))
Upvotes: 1
Reputation: 150775
How about:
(df['product_name']
.str[1:-1]
.str.replace('\'','')
.str.split(', ')
)
Upvotes: 3