itzy
itzy

Reputation: 11765

Expand pandas dataframe based on range in a column

I have a pandas dataframe like this:

Name   SICs
Agric  0100-0199
Agric  0910-0919
Agric  2048-2048
Food   2000-2009
Food   2010-2019
Soda   2097-2097

The SICs column gives a range of integer values that match the Name given in the first column (although they're stored as a string).

I need to expand this DataFrame so that it has one row for each integer in the range:

Agric  100
Agric  101
Agric  102
...
Agric  199
Agric  910
Agric  911
...
Agric  919
Agric 2048
Food  2000
...

Is there a particularly good way to do this? I was going to do something like this

ranges = {i:r.split('-') for i, r in enumerate(inds['SICs'])}
ranges_expanded = {}
for r in ranges:
    ranges_expanded[r] = range(int(ranges[r][0]),int(ranges[r][1])+1)

but I wonder if there's a better way or perhaps a pandas feature to do this. (Also, I'm not sure this will work, as I don't yet see how to read the ranges_expanded dictionary into a DataFrame.)

Upvotes: 3

Views: 3068

Answers (4)

Morgan Harris
Morgan Harris

Reputation: 2619

This is the top Google search result for "pandas expand ranges to rows", and it doesn't have a very satisfying answer. So here's one that is fast and uses Pandas and NumPy functionality. We start by using str.extract as the other solutions do, but then we're going to use numpy.repeat and DataFrame.groupby to do our expansion.

data = pd.DataFrame.from_records([
('Agric',  '0100-0199'),
('Agric',  '0910-0919'),
('Agric',  '2048-2048'),
('Food',   '2000-2009'),
('Food',   '2010-2019'),
('Soda',   '2097-2097'),
], columns=('Name', 'SICs'))

# Extract the SICs minimum and maximum
data = pd.concat((data, data['SICs'].str.extract("(?P<SICs_min>\d+)-(?P<SICs_max>\d+)")),axis=1)

# Set their types to be integers, as str.extract comes up with objects
data = data.astype({'SICs_min':np.int32, 'SICs_max':np.int32})

# Add a SICs_len column for the size of our range. We're also adding an original_index column we'll need later.
data = data.assign(SICs_len=data.SICs_max-data.SICs_min+1, original_index=data.index)

# Create a new dataframe by repeating the values from the original SICs_len times. So the rows from the
# first range will be repeated 100 times, the second range will be repeated 10 times, etc.
new_data = pd.DataFrame(np.repeat(data.values, data.SICs_len, axis=0))
new_data.columns = data.columns
new_data = new_data.astype(data.dtypes)


# At this point, we've expanded our ranges, but we don't have a 
# column that represents the number that the range has been expanded into.

# To get that, we're going to use the SICs_min column, plus a counter that resets
# every time we get to a new SIC range. 

# The counter can be made by grouping by the original SICs range and then doing
# a cumulative sum of a column of just ones.

cumsum = new_data.assign(ones=np.ones_like(new_data.index)).groupby('original_index').ones.cumsum()

# This looks like 1-100, followed by 1-10, etc. We need it to be 0-99, 0-9 etc. so we subtract one
# before we add it to the SICs_min to get the actual SIC number.

final_data = new_data.assign(SIC=new_data.SICs_min + cumsum - 1)
print(final_data)

Our final data frame looks like this:

      Name       SICs SICs_min SICs_max SICs_len original_index   SIC
0    Agric  0100-0199      100      199      100              0   100
1    Agric  0100-0199      100      199      100              0   101
2    Agric  0100-0199      100      199      100              0   102
3    Agric  0100-0199      100      199      100              0   103
4    Agric  0100-0199      100      199      100              0   104
..     ...        ...      ...      ...      ...            ...   ...
127   Food  2010-2019     2010     2019       10              4  2016
128   Food  2010-2019     2010     2019       10              4  2017
129   Food  2010-2019     2010     2019       10              4  2018
130   Food  2010-2019     2010     2019       10              4  2019
131   Soda  2097-2097     2097     2097        1              5  2097

[132 rows x 7 columns]

You might want to tidy it up by removing some columns:

final_data = final_data.drop(columns=['SICs_min', 'SICs_max', 'SICs_len', 'original_index'])

You might ask, why bother with the original_index column? Well, this allows us to have multiple overlapping ranges, even ranges that start with the same number. Try duplicating the last 'Food' range and give it a different Name - you'll see that both ranges end up in the final table. In fact, even if you give it the same name, you still get both ranges! Without the original_index column, there's no other way to do this.

Upvotes: 0

joshlk
joshlk

Reputation: 1651

The neatest way I found (building on from Andy Hayden's answer):

# Extract date min and max
df = df.set_index("Name")
df = df['SICs'].str.extract("(\d+)-(\d+)")
df.columns = ['min', 'max']
df = df.astype('int')

# Enumerate dates into wide table
enumerated_dates = [np.arange(row['min'], row['max']+1) for _, row in df.iterrows()]
df = pd.DataFrame.from_records(data=enumerated_dates, index=df.index)

# Convert from wide to long table
df = df.stack().reset_index(1, drop=True)

It is however slow due to the for loop. A vectorised solution would be amazing but I cant find one.

Upvotes: 2

ryanmc
ryanmc

Reputation: 1881

Quick and dirty but I think this gets you to what you need:

from io import StringIO
import pandas as pd

players=StringIO(u"""Name,SICs
Agric,0100-0199
Agric,0210-0211
Food,2048-2048
Soda,1198-1200""")

df = pd.DataFrame.from_csv(players, sep=",", parse_dates=False).reset_index()


df2 = pd.DataFrame(columns=('Name', 'SIC'))

count = 0
for idx,r in df.iterrows():
    data = r['SICs'].split("-")
    for i in range(int(data[0]), int(data[1])+1):
                   df2.loc[count] = (r['Name'], i)
                   count += 1

Upvotes: 3

Andy Hayden
Andy Hayden

Reputation: 375685

You can use str.extract to get strings from a regular expression:

In [11]: df
Out[11]:
   Name       SICs
0  Agri  0100-0199
1  Agri  0910-0919
2  Food  2000-2009

First take out the name as that's the thing we want to keep:

In [12]: df1 = df.set_index("Name")

In [13]: df1
Out[13]:
           SICs
Name
Agri  0100-0199
Agri  0910-0919
Food  2000-2009

In [14]: df1['SICs'].str.extract("(\d+)-(\d+)")
Out[14]:
         0     1
Name
Agri  0100  0199
Agri  0910  0919
Food  2000  2009

Then flatten this with stack (which adds a MultiIndex):

In [15]: df1['SICs'].str.extract("(\d+)-(\d+)").stack()
Out[15]:
Name
Agri  0    0100
      1    0199
      0    0910
      1    0919
Food  0    2000
      1    2009
dtype: object

If you must you can remove the 0-1 level of the MultiIndex:

In [16]: df1['SICs'].str.extract("(\d+)-(\d+)").stack().reset_index(1, drop=True)
Out[16]:
Name
Agri    0100
Agri    0199
Agri    0910
Agri    0919
Food    2000
Food    2009
dtype: object

Upvotes: 1

Related Questions