Reputation: 10150
I have a test dataframe that looks something like this:
data = pd.DataFrame([[0,0,0,3,6,5,6,1],[1,1,1,3,4,5,2,0],[2,1,0,3,6,5,6,1],[3,0,0,2,9,4,2,1]], columns=["id", "sex", "split", "group0Low", "group0High", "group1Low", "group1High", "trim"])
grouped = data.groupby(['sex','split']).mean()
stacked = grouped.stack().reset_index(level=2)
stacked.columns = ['group_level', 'mean']
Next, I want to separate out group_level and stack those 2 new factors:
stacked['group'] = stacked.group_level.str[:6]
stacked['level'] = stacked.group_level.str[6:]
This all works fine. My question is this:
This works if my column names ("group0Low", "group0High", "group1Low", "group1High") have something in common with each other.
What if instead my column names were more like "routeLow", "routeHigh", "landmarkLow", "landmarkHigh"? How would I use str to split group_level in this case?
This question is similar to this one posted here: Slice/split string Series at various positions
The difference is all of my column subnames are different and have no commonality (whereas in the other post everything had group or class in the name). Is there a regex string, or some other method, I can use to do this stacking?
Upvotes: 3
Views: 8274
Reputation: 8906
I suppose it depends how general the strings you're working are. Assuming the only levels are always delimited by a capital letter you can do
In [30]:
s = pd.Series(['routeHigh', 'routeLow', 'landmarkHigh',
'landmarkLow', 'routeMid', 'group0Level'])
s.str.extract('([\d\w]*)([A-Z][\w\d]*)')
Out[30]:
0 1
0 route High
1 route Low
2 landmark High
3 landmark Low
4 route Mid
5 group0 Level
You can even name the columns of the result in the same line by doing
s.str.extract('(?P<group>[\d\w]*)(?P<Level>[A-Z][\w\d]*)')
So in your use case you can do
group_level_df = stacked.group_level.extract('(?P<group>[\d\w]*)(?P<Level>[A-Z][\w\d]*)')
stacked = pd.concat([stacked, group_level_df])
Here's another approach which assumes only knowledge of the level names in advance. Suppose you have three levels:
lower = stacked.group_level.str.lower()
for level in ['low', 'mid', 'high']:
rows_in = lower.str.contains(level)
stacked.loc[rows_in, 'level'] = level.capitalize()
stacked.loc[rows_in, 'group'] = stacked.group_level[rows_in].str.replace(level, '')
Which should work as long as the level doesn't appear in the group name as well, e.g. 'highballHigh'. In cases where group_level
didn't contain any of these levels you would end up with null values in the corresponding rows
Upvotes: 2
Reputation: 24742
Here is another way. It assumes that low/high group ends with the words Low
and High
respectively, so that we can use .str.endswith()
to identify which rows are Low/High.
Here is the sample data
df = pd.DataFrame('group0Low group0High group1Low group1High routeLow routeHigh landmarkLow landmarkHigh'.split(), columns=['group_level'])
df
group_level
0 group0Low
1 group0High
2 group1Low
3 group1High
4 routeLow
5 routeHigh
6 landmarkLow
7 landmarkHigh
Use np.where
, we can do the following
df['level'] = np.where(df['group_level'].str.endswith('Low'), 'Low', 'High')
df['group'] = np.where(df['group_level'].str.endswith('Low'), df['group_level'].str[:-3], df['group_level'].str[:-4])
df
group_level level group
0 group0Low Low group0
1 group0High High group0
2 group1Low Low group1
3 group1High High group1
4 routeLow Low route
5 routeHigh High route
6 landmarkLow Low landmark
7 landmarkHigh High landmark
Upvotes: 3