Reputation: 11765
I have a pandas dataframe like this:
Name SICs
Agric 0100-0199
Agric 0910-0919
Agric 2048-2048
Food 2000-2009
Food 2010-2019
Soda 2097-2097
The SICs
column gives a range of integer values that match the Name
given in the first column (although they're stored as a string).
I need to expand this DataFrame so that it has one row for each integer in the range:
Agric 100
Agric 101
Agric 102
...
Agric 199
Agric 910
Agric 911
...
Agric 919
Agric 2048
Food 2000
...
Is there a particularly good way to do this? I was going to do something like this
ranges = {i:r.split('-') for i, r in enumerate(inds['SICs'])}
ranges_expanded = {}
for r in ranges:
ranges_expanded[r] = range(int(ranges[r][0]),int(ranges[r][1])+1)
but I wonder if there's a better way or perhaps a pandas feature to do this. (Also, I'm not sure this will work, as I don't yet see how to read the ranges_expanded
dictionary into a DataFrame.)
Upvotes: 3
Views: 3068
Reputation: 2619
This is the top Google search result for "pandas expand ranges to rows", and it doesn't have a very satisfying answer. So here's one that is fast and uses Pandas and NumPy functionality. We start by using str.extract
as the other solutions do, but then we're going to use numpy.repeat
and DataFrame.groupby
to do our expansion.
data = pd.DataFrame.from_records([
('Agric', '0100-0199'),
('Agric', '0910-0919'),
('Agric', '2048-2048'),
('Food', '2000-2009'),
('Food', '2010-2019'),
('Soda', '2097-2097'),
], columns=('Name', 'SICs'))
# Extract the SICs minimum and maximum
data = pd.concat((data, data['SICs'].str.extract("(?P<SICs_min>\d+)-(?P<SICs_max>\d+)")),axis=1)
# Set their types to be integers, as str.extract comes up with objects
data = data.astype({'SICs_min':np.int32, 'SICs_max':np.int32})
# Add a SICs_len column for the size of our range. We're also adding an original_index column we'll need later.
data = data.assign(SICs_len=data.SICs_max-data.SICs_min+1, original_index=data.index)
# Create a new dataframe by repeating the values from the original SICs_len times. So the rows from the
# first range will be repeated 100 times, the second range will be repeated 10 times, etc.
new_data = pd.DataFrame(np.repeat(data.values, data.SICs_len, axis=0))
new_data.columns = data.columns
new_data = new_data.astype(data.dtypes)
# At this point, we've expanded our ranges, but we don't have a
# column that represents the number that the range has been expanded into.
# To get that, we're going to use the SICs_min column, plus a counter that resets
# every time we get to a new SIC range.
# The counter can be made by grouping by the original SICs range and then doing
# a cumulative sum of a column of just ones.
cumsum = new_data.assign(ones=np.ones_like(new_data.index)).groupby('original_index').ones.cumsum()
# This looks like 1-100, followed by 1-10, etc. We need it to be 0-99, 0-9 etc. so we subtract one
# before we add it to the SICs_min to get the actual SIC number.
final_data = new_data.assign(SIC=new_data.SICs_min + cumsum - 1)
print(final_data)
Our final data frame looks like this:
Name SICs SICs_min SICs_max SICs_len original_index SIC
0 Agric 0100-0199 100 199 100 0 100
1 Agric 0100-0199 100 199 100 0 101
2 Agric 0100-0199 100 199 100 0 102
3 Agric 0100-0199 100 199 100 0 103
4 Agric 0100-0199 100 199 100 0 104
.. ... ... ... ... ... ... ...
127 Food 2010-2019 2010 2019 10 4 2016
128 Food 2010-2019 2010 2019 10 4 2017
129 Food 2010-2019 2010 2019 10 4 2018
130 Food 2010-2019 2010 2019 10 4 2019
131 Soda 2097-2097 2097 2097 1 5 2097
[132 rows x 7 columns]
You might want to tidy it up by removing some columns:
final_data = final_data.drop(columns=['SICs_min', 'SICs_max', 'SICs_len', 'original_index'])
You might ask, why bother with the original_index
column? Well, this allows us to have multiple overlapping ranges, even ranges that start with the same number. Try duplicating the last 'Food' range and give it a different Name - you'll see that both ranges end up in the final table. In fact, even if you give it the same name, you still get both ranges! Without the original_index
column, there's no other way to do this.
Upvotes: 0
Reputation: 1651
The neatest way I found (building on from Andy Hayden's answer):
# Extract date min and max
df = df.set_index("Name")
df = df['SICs'].str.extract("(\d+)-(\d+)")
df.columns = ['min', 'max']
df = df.astype('int')
# Enumerate dates into wide table
enumerated_dates = [np.arange(row['min'], row['max']+1) for _, row in df.iterrows()]
df = pd.DataFrame.from_records(data=enumerated_dates, index=df.index)
# Convert from wide to long table
df = df.stack().reset_index(1, drop=True)
It is however slow due to the for loop. A vectorised solution would be amazing but I cant find one.
Upvotes: 2
Reputation: 1881
Quick and dirty but I think this gets you to what you need:
from io import StringIO
import pandas as pd
players=StringIO(u"""Name,SICs
Agric,0100-0199
Agric,0210-0211
Food,2048-2048
Soda,1198-1200""")
df = pd.DataFrame.from_csv(players, sep=",", parse_dates=False).reset_index()
df2 = pd.DataFrame(columns=('Name', 'SIC'))
count = 0
for idx,r in df.iterrows():
data = r['SICs'].split("-")
for i in range(int(data[0]), int(data[1])+1):
df2.loc[count] = (r['Name'], i)
count += 1
Upvotes: 3
Reputation: 375685
You can use str.extract to get strings from a regular expression:
In [11]: df
Out[11]:
Name SICs
0 Agri 0100-0199
1 Agri 0910-0919
2 Food 2000-2009
First take out the name as that's the thing we want to keep:
In [12]: df1 = df.set_index("Name")
In [13]: df1
Out[13]:
SICs
Name
Agri 0100-0199
Agri 0910-0919
Food 2000-2009
In [14]: df1['SICs'].str.extract("(\d+)-(\d+)")
Out[14]:
0 1
Name
Agri 0100 0199
Agri 0910 0919
Food 2000 2009
Then flatten this with stack (which adds a MultiIndex):
In [15]: df1['SICs'].str.extract("(\d+)-(\d+)").stack()
Out[15]:
Name
Agri 0 0100
1 0199
0 0910
1 0919
Food 0 2000
1 2009
dtype: object
If you must you can remove the 0-1 level of the MultiIndex:
In [16]: df1['SICs'].str.extract("(\d+)-(\d+)").stack().reset_index(1, drop=True)
Out[16]:
Name
Agri 0100
Agri 0199
Agri 0910
Agri 0919
Food 2000
Food 2009
dtype: object
Upvotes: 1