kabaname
kabaname

Reputation: 265

Split (explode) range in dataframe into multiple rows

This question is similar to Split (explode) pandas dataframe string entry to separate rows but includes a question about adding ranges.

I have a DataFrame:

+------+---------+----------------+
| Name | Options | Email          |
+------+---------+----------------+
| Bob  | 1,2,4-6 | [email protected]  |
+------+---------+----------------+
| John |   NaN   | [email protected] |
+------+---------+----------------+
| Mary |   1,2   | [email protected] |
+------+---------+----------------+
| Jane | 1,3-5   | [email protected] |
+------+---------+----------------+

And I'd like the Options column to be split by the comma as well as rows added for a range.

+------+---------+----------------+
| Name | Options | Email          |
+------+---------+----------------+
| Bob  | 1       | [email protected]  |
+------+---------+----------------+
| Bob  | 2       | [email protected]  |
+------+---------+----------------+
| Bob  | 4       | [email protected]  |
+------+---------+----------------+
| Bob  | 5       | [email protected]  |
+------+---------+----------------+
| Bob  | 6       | [email protected]  |
+------+---------+----------------+
| John | NaN     | [email protected] |
+------+---------+----------------+
| Mary | 1       | [email protected] |
+------+---------+----------------+
| Mary | 2       | [email protected] |
+------+---------+----------------+
| Jane | 1       | [email protected] |
+------+---------+----------------+
| Jane | 3       | [email protected] |
+------+---------+----------------+
| Jane | 4       | [email protected] |
+------+---------+----------------+
| Jane | 5       | [email protected] |
+------+---------+----------------+

How can I go beyond using concat and split like the reference SO article says to accomplish this? I need a way to add a range.

That article uses the following code to split comma delineated values (1,2,3):

In [7]: a
Out[7]: 
    var1  var2
0  a,b,c     1
1  d,e,f     2

In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))              
                    for _, row in a.iterrows()]).reset_index()
Out[55]: 
  index  0

0     a  1
1     b  1
2     c  1
3     d  2
4     e  2
5     f  2

Thanks in advance for your suggestions!

Update 2/14 Sample data was updated to match my current case.

Upvotes: 6

Views: 2149

Answers (4)

piRSquared
piRSquared

Reputation: 294488

I like using np.r_ and slice
I know it looks like a mess but beauty is in the eye of the beholder.

def parse(o):
    mm = lambda i: slice(min(i), max(i) + 1)
    return np.r_.__getitem__(tuple(
        mm(list(map(int, s.strip().split('-')))) for s in o.split(',')
    ))

r = df.Options.apply(parse)
new = np.concatenate(r.values)
lens = r.str.len()

df.loc[df.index.repeat(lens)].assign(Options=new)

   Name  Options           Email
0   Bob        1   [email protected]
0   Bob        2   [email protected]
0   Bob        4   [email protected]
0   Bob        5   [email protected]
0   Bob        6   [email protected]
2  Mary        1  [email protected]
2  Mary        2  [email protected]
3  Jane        1  [email protected]
3  Jane        3  [email protected]
3  Jane        4  [email protected]
3  Jane        5  [email protected]

Explanation

  • np.r_ takes different slicers and indexers and returns an array of the combination.

    np.r_[1, 4:7]
    array([1, 4, 5, 6])
    

    or

    np.r_[slice(1, 2), slice(4, 7)]
    array([1, 4, 5, 6])
    

    But if I need to pass an arbitrary bunch of them, I need to pass a tuple to np.r_ s __getitem__ method.

    np.r_.__getitem__((slice(1, 2), slice(4, 7), slice(10, 14)))
    array([ 1,  4,  5,  6, 10, 11, 12, 13])
    

    So I iterate, parse, make slices and pass to np.r_.__getitem__

  • Use a combo of loc, pd.Index.repeat, and pd.Series.str.len after applying my cool parser

  • Use pd.DataFrame.assign to overwrite existing column

__NOTE__
If you have bad characters in your Options column, I'd try to filter like this.

df = df.dropna(subset=['Options']).astype(dict(Options=str)) \
       .replace(dict(Options={'[^0-9,\-]': ''}), regex=True) \
       .query('Options != ""')

Upvotes: 6

jpp
jpp

Reputation: 164773

Here is one solution. While it's not pretty (minimal use of pandas), it is fairly efficient.

import itertools, pandas as pd, numpy as np; concat = itertools.chain.from_iterable

def ranger(mystr):
    return list(concat([int(i)] if '-' not in i else \
                list(range(int(i.split('-')[0]), int(i.split('-')[-1])+1)) \
                for i in mystr.split(',')))

df = pd.DataFrame([['Bob', '1,2,4-6', '[email protected]'],
                   ['Jane', '1,3-5', '[email protected]']],
                  columns=['Name', 'Options', 'Email'])

df['Options'] = df['Options'].map(ranger)

lens = list(map(len, df['Options']))

df_out = pd.DataFrame({'Name': np.repeat(df['Name'].values, lens),
                       'Email': np.repeat(df['Email'].values, lens),
                       'Option': np.hstack(df['Options'].values)})

#             Email  Name  Option
# 0   [email protected]   Bob       1
# 1   [email protected]   Bob       2
# 2   [email protected]   Bob       4
# 3   [email protected]   Bob       5
# 4   [email protected]   Bob       6
# 5  [email protected]  Jane       1
# 6  [email protected]  Jane       3
# 7  [email protected]  Jane       4
# 8  [email protected]  Jane       5

Benchmarking of 4 solutions below (for interest only).

As a general rule, the repeat varieties outperform. In addition, solutions that create new dataframes from scratch (as opposed to apply) do better. Dropping down to numpy gives best results.

import itertools, pandas as pd, numpy as np; concat = itertools.chain.from_iterable

def ranger(mystr):
    return list(concat([int(i)] if '-' not in i else \
                list(range(int(i.split('-')[0]), int(i.split('-')[-1])+1)) \
                for i in mystr.split(',')))

def replace(x):
    i, j = map(int, x.groups())
    return ','.join(map(str, range(i, j + 1)))

def yourfunc(s):
    ranges = (x.split("-") for x in s.split(","))
    return [i for r in ranges for i in range(int(r[0]), int(r[-1]) + 1)]

def parse(o):
    mm = lambda i: slice(min(i), max(i) + 1)
    return np.r_.__getitem__(tuple(mm(list(map(int, s.strip().split('-')))) for s in o.split(',')))

df = pd.DataFrame([['Bob', '1,2,4-6', '[email protected]'],
                   ['Jane', '1,3-5', '[email protected]']],
                  columns=['Name', 'Options', 'Email'])

df = pd.concat([df]*1000, ignore_index=True)

def explode_jp(df):
    df['Options'] = df['Options'].map(ranger)
    lens = list(map(len, df['Options']))
    df_out = pd.DataFrame({'Name': np.repeat(df['Name'].values, lens),
                           'Email': np.repeat(df['Email'].values, lens),
                           'Option': np.hstack(df['Options'].values)})
    return df_out

def explode_cs(df):
    c = df.columns
    v = df.Options.str.replace('(\d+)-(\d+)', replace).str.split(',')
    df_out = pd.DataFrame(df.drop('Options', 1).values.repeat(v.str.len(), axis=0))
    df_out.insert(c.get_loc('Options'), len(c) - 1, np.concatenate(v))
    df_out.columns = c
    return df_out

def explode_wen(df):
    df.Options=df.Options.apply(yourfunc)
    df_out = df.set_index(['Name','Email']).Options.apply(pd.Series).stack().reset_index().drop('level_2',1)
    return df_out

def explode_pir(df):
    r = df.Options.apply(parse)
    df_out = df.loc[df.index.repeat(r.str.len())].assign(Options=np.concatenate(r))
    return df_out

%timeit explode_jp(df.copy())   # 32.7 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit explode_cs(df.copy())   # 90.6 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit explode_wen(df.copy())  # 675 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit explode_pir(df.copy())  # 163 ms ± 1.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 4

cs95
cs95

Reputation: 402854

Start with a custom replacement function:

def replace(x):
    i, j = map(int, x.groups())
    return ','.join(map(str, range(i, j + 1)))

Store the column names somewhere, we'll use them later:

c = df.columns

Next, replace items in df.Options, then split on a comma:

v = df.Options.str.replace('(\d+)-(\d+)', replace).str.split(',')

Next, reshape your data and finally load into a new dataframe:

df = pd.DataFrame(
       df.drop('Options', 1).values.repeat(v.str.len(), axis=0)
)
df.insert(c.get_loc('Options'), len(c) - 1, np.concatenate(v))
df.columns = c

df

   Name Options           Email
0   Bob       1   [email protected]
1   Bob       2   [email protected]
2   Bob       4   [email protected]
3   Bob       5   [email protected]
4   Bob       6   [email protected]
5  Jane       1  [email protected]
6  Jane       3  [email protected]
7  Jane       4  [email protected]
8  Jane       5  [email protected]

Upvotes: 5

BENY
BENY

Reputation: 323326

If I understand what you need

def yourfunc(s):
    ranges = (x.split("-") for x in s.split(","))

    return [i for r in ranges for i in range(int(r[0]), int(r[-1]) + 1)]


df.Options=df.Options.apply(yourfunc)

df
Out[114]: 
   Name          Options           Email
0   Bob  [1, 2, 4, 5, 6]   [email protected]
1  Jane     [1, 3, 4, 5]  [email protected]


df.set_index(['Name','Email']).Options.apply(pd.Series).stack().reset_index().drop('level_2',1)
Out[116]: 
   Name           Email    0
0   Bob   [email protected]  1.0
1   Bob   [email protected]  2.0
2   Bob   [email protected]  4.0
3   Bob   [email protected]  5.0
4   Bob   [email protected]  6.0
5  Jane  [email protected]  1.0
6  Jane  [email protected]  3.0
7  Jane  [email protected]  4.0
8  Jane  [email protected]  5.0

Upvotes: 6

Related Questions