Reputation: 265
This question is similar to Split (explode) pandas dataframe string entry to separate rows but includes a question about adding ranges.
I have a DataFrame:
+------+---------+----------------+
| Name | Options | Email |
+------+---------+----------------+
| Bob | 1,2,4-6 | [email protected] |
+------+---------+----------------+
| John | NaN | [email protected] |
+------+---------+----------------+
| Mary | 1,2 | [email protected] |
+------+---------+----------------+
| Jane | 1,3-5 | [email protected] |
+------+---------+----------------+
And I'd like the Options
column to be split by the comma as well as rows added for a range.
+------+---------+----------------+
| Name | Options | Email |
+------+---------+----------------+
| Bob | 1 | [email protected] |
+------+---------+----------------+
| Bob | 2 | [email protected] |
+------+---------+----------------+
| Bob | 4 | [email protected] |
+------+---------+----------------+
| Bob | 5 | [email protected] |
+------+---------+----------------+
| Bob | 6 | [email protected] |
+------+---------+----------------+
| John | NaN | [email protected] |
+------+---------+----------------+
| Mary | 1 | [email protected] |
+------+---------+----------------+
| Mary | 2 | [email protected] |
+------+---------+----------------+
| Jane | 1 | [email protected] |
+------+---------+----------------+
| Jane | 3 | [email protected] |
+------+---------+----------------+
| Jane | 4 | [email protected] |
+------+---------+----------------+
| Jane | 5 | [email protected] |
+------+---------+----------------+
How can I go beyond using concat
and split
like the reference SO article says to accomplish this? I need a way to add a range.
That article uses the following code to split comma delineated values (1,2,3
):
In [7]: a
Out[7]:
var1 var2
0 a,b,c 1
1 d,e,f 2
In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))
for _, row in a.iterrows()]).reset_index()
Out[55]:
index 0
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 2
Thanks in advance for your suggestions!
Update 2/14 Sample data was updated to match my current case.
Upvotes: 6
Views: 2149
Reputation: 294488
I like using np.r_
and slice
I know it looks like a mess but beauty is in the eye of the beholder.
def parse(o):
mm = lambda i: slice(min(i), max(i) + 1)
return np.r_.__getitem__(tuple(
mm(list(map(int, s.strip().split('-')))) for s in o.split(',')
))
r = df.Options.apply(parse)
new = np.concatenate(r.values)
lens = r.str.len()
df.loc[df.index.repeat(lens)].assign(Options=new)
Name Options Email
0 Bob 1 [email protected]
0 Bob 2 [email protected]
0 Bob 4 [email protected]
0 Bob 5 [email protected]
0 Bob 6 [email protected]
2 Mary 1 [email protected]
2 Mary 2 [email protected]
3 Jane 1 [email protected]
3 Jane 3 [email protected]
3 Jane 4 [email protected]
3 Jane 5 [email protected]
Explanation
np.r_
takes different slicers and indexers and returns an array of the combination.
np.r_[1, 4:7]
array([1, 4, 5, 6])
or
np.r_[slice(1, 2), slice(4, 7)]
array([1, 4, 5, 6])
But if I need to pass an arbitrary bunch of them, I need to pass a tuple
to np.r_
s __getitem__
method.
np.r_.__getitem__((slice(1, 2), slice(4, 7), slice(10, 14)))
array([ 1, 4, 5, 6, 10, 11, 12, 13])
So I iterate, parse, make slices and pass to np.r_.__getitem__
Use a combo of loc
, pd.Index.repeat
, and pd.Series.str.len
after applying my cool parser
pd.DataFrame.assign
to overwrite existing column__NOTE__
If you have bad characters in your Options
column, I'd try to filter like this.
df = df.dropna(subset=['Options']).astype(dict(Options=str)) \
.replace(dict(Options={'[^0-9,\-]': ''}), regex=True) \
.query('Options != ""')
Upvotes: 6
Reputation: 164773
Here is one solution. While it's not pretty (minimal use of pandas
), it is fairly efficient.
import itertools, pandas as pd, numpy as np; concat = itertools.chain.from_iterable
def ranger(mystr):
return list(concat([int(i)] if '-' not in i else \
list(range(int(i.split('-')[0]), int(i.split('-')[-1])+1)) \
for i in mystr.split(',')))
df = pd.DataFrame([['Bob', '1,2,4-6', '[email protected]'],
['Jane', '1,3-5', '[email protected]']],
columns=['Name', 'Options', 'Email'])
df['Options'] = df['Options'].map(ranger)
lens = list(map(len, df['Options']))
df_out = pd.DataFrame({'Name': np.repeat(df['Name'].values, lens),
'Email': np.repeat(df['Email'].values, lens),
'Option': np.hstack(df['Options'].values)})
# Email Name Option
# 0 [email protected] Bob 1
# 1 [email protected] Bob 2
# 2 [email protected] Bob 4
# 3 [email protected] Bob 5
# 4 [email protected] Bob 6
# 5 [email protected] Jane 1
# 6 [email protected] Jane 3
# 7 [email protected] Jane 4
# 8 [email protected] Jane 5
Benchmarking of 4 solutions below (for interest only).
As a general rule, the repeat
varieties outperform. In addition, solutions that create new dataframes from scratch (as opposed to apply
) do better. Dropping down to numpy
gives best results.
import itertools, pandas as pd, numpy as np; concat = itertools.chain.from_iterable
def ranger(mystr):
return list(concat([int(i)] if '-' not in i else \
list(range(int(i.split('-')[0]), int(i.split('-')[-1])+1)) \
for i in mystr.split(',')))
def replace(x):
i, j = map(int, x.groups())
return ','.join(map(str, range(i, j + 1)))
def yourfunc(s):
ranges = (x.split("-") for x in s.split(","))
return [i for r in ranges for i in range(int(r[0]), int(r[-1]) + 1)]
def parse(o):
mm = lambda i: slice(min(i), max(i) + 1)
return np.r_.__getitem__(tuple(mm(list(map(int, s.strip().split('-')))) for s in o.split(',')))
df = pd.DataFrame([['Bob', '1,2,4-6', '[email protected]'],
['Jane', '1,3-5', '[email protected]']],
columns=['Name', 'Options', 'Email'])
df = pd.concat([df]*1000, ignore_index=True)
def explode_jp(df):
df['Options'] = df['Options'].map(ranger)
lens = list(map(len, df['Options']))
df_out = pd.DataFrame({'Name': np.repeat(df['Name'].values, lens),
'Email': np.repeat(df['Email'].values, lens),
'Option': np.hstack(df['Options'].values)})
return df_out
def explode_cs(df):
c = df.columns
v = df.Options.str.replace('(\d+)-(\d+)', replace).str.split(',')
df_out = pd.DataFrame(df.drop('Options', 1).values.repeat(v.str.len(), axis=0))
df_out.insert(c.get_loc('Options'), len(c) - 1, np.concatenate(v))
df_out.columns = c
return df_out
def explode_wen(df):
df.Options=df.Options.apply(yourfunc)
df_out = df.set_index(['Name','Email']).Options.apply(pd.Series).stack().reset_index().drop('level_2',1)
return df_out
def explode_pir(df):
r = df.Options.apply(parse)
df_out = df.loc[df.index.repeat(r.str.len())].assign(Options=np.concatenate(r))
return df_out
%timeit explode_jp(df.copy()) # 32.7 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit explode_cs(df.copy()) # 90.6 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit explode_wen(df.copy()) # 675 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit explode_pir(df.copy()) # 163 ms ± 1.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 4
Reputation: 402854
Start with a custom replacement function:
def replace(x):
i, j = map(int, x.groups())
return ','.join(map(str, range(i, j + 1)))
Store the column names somewhere, we'll use them later:
c = df.columns
Next, replace items in df.Options
, then split on a comma:
v = df.Options.str.replace('(\d+)-(\d+)', replace).str.split(',')
Next, reshape your data and finally load into a new dataframe:
df = pd.DataFrame(
df.drop('Options', 1).values.repeat(v.str.len(), axis=0)
)
df.insert(c.get_loc('Options'), len(c) - 1, np.concatenate(v))
df.columns = c
df
Name Options Email
0 Bob 1 [email protected]
1 Bob 2 [email protected]
2 Bob 4 [email protected]
3 Bob 5 [email protected]
4 Bob 6 [email protected]
5 Jane 1 [email protected]
6 Jane 3 [email protected]
7 Jane 4 [email protected]
8 Jane 5 [email protected]
Upvotes: 5
Reputation: 323326
If I understand what you need
def yourfunc(s):
ranges = (x.split("-") for x in s.split(","))
return [i for r in ranges for i in range(int(r[0]), int(r[-1]) + 1)]
df.Options=df.Options.apply(yourfunc)
df
Out[114]:
Name Options Email
0 Bob [1, 2, 4, 5, 6] [email protected]
1 Jane [1, 3, 4, 5] [email protected]
df.set_index(['Name','Email']).Options.apply(pd.Series).stack().reset_index().drop('level_2',1)
Out[116]:
Name Email 0
0 Bob [email protected] 1.0
1 Bob [email protected] 2.0
2 Bob [email protected] 4.0
3 Bob [email protected] 5.0
4 Bob [email protected] 6.0
5 Jane [email protected] 1.0
6 Jane [email protected] 3.0
7 Jane [email protected] 4.0
8 Jane [email protected] 5.0
Upvotes: 6