Simon
Simon

Reputation: 10150

Pandas split column name

I have a test dataframe that looks something like this:

data = pd.DataFrame([[0,0,0,3,6,5,6,1],[1,1,1,3,4,5,2,0],[2,1,0,3,6,5,6,1],[3,0,0,2,9,4,2,1]], columns=["id", "sex", "split", "group0Low", "group0High", "group1Low", "group1High", "trim"])

grouped = data.groupby(['sex','split']).mean()

stacked = grouped.stack().reset_index(level=2)
stacked.columns = ['group_level', 'mean']

Next, I want to separate out group_level and stack those 2 new factors:

stacked['group'] = stacked.group_level.str[:6]
stacked['level'] = stacked.group_level.str[6:]

This all works fine. My question is this:

This works if my column names ("group0Low", "group0High", "group1Low", "group1High") have something in common with each other.

What if instead my column names were more like "routeLow", "routeHigh", "landmarkLow", "landmarkHigh"? How would I use str to split group_level in this case?

This question is similar to this one posted here: Slice/split string Series at various positions

The difference is all of my column subnames are different and have no commonality (whereas in the other post everything had group or class in the name). Is there a regex string, or some other method, I can use to do this stacking?

Upvotes: 3

Views: 8274

Answers (2)

JoeCondron
JoeCondron

Reputation: 8906

I suppose it depends how general the strings you're working are. Assuming the only levels are always delimited by a capital letter you can do

In [30]:    
s = pd.Series(['routeHigh', 'routeLow', 'landmarkHigh', 
               'landmarkLow', 'routeMid', 'group0Level'])
s.str.extract('([\d\w]*)([A-Z][\w\d]*)')

Out[30]:
    0       1
0   route   High
1   route   Low
2   landmark    High
3   landmark    Low
4   route   Mid
5   group0  Level

You can even name the columns of the result in the same line by doing

s.str.extract('(?P<group>[\d\w]*)(?P<Level>[A-Z][\w\d]*)')

So in your use case you can do

group_level_df = stacked.group_level.extract('(?P<group>[\d\w]*)(?P<Level>[A-Z][\w\d]*)')
stacked = pd.concat([stacked, group_level_df])

Here's another approach which assumes only knowledge of the level names in advance. Suppose you have three levels:

lower = stacked.group_level.str.lower()
for level in ['low', 'mid', 'high']:

    rows_in = lower.str.contains(level)
    stacked.loc[rows_in, 'level'] = level.capitalize()  
    stacked.loc[rows_in, 'group'] = stacked.group_level[rows_in].str.replace(level, '')

Which should work as long as the level doesn't appear in the group name as well, e.g. 'highballHigh'. In cases where group_level didn't contain any of these levels you would end up with null values in the corresponding rows

Upvotes: 2

Jianxun Li
Jianxun Li

Reputation: 24742

Here is another way. It assumes that low/high group ends with the words Low and High respectively, so that we can use .str.endswith() to identify which rows are Low/High.

Here is the sample data

df = pd.DataFrame('group0Low group0High group1Low group1High routeLow routeHigh landmarkLow landmarkHigh'.split(), columns=['group_level'])
df

    group_level
0     group0Low
1    group0High
2     group1Low
3    group1High
4      routeLow
5     routeHigh
6   landmarkLow
7  landmarkHigh

Use np.where, we can do the following

df['level'] = np.where(df['group_level'].str.endswith('Low'), 'Low', 'High')
df['group'] = np.where(df['group_level'].str.endswith('Low'), df['group_level'].str[:-3], df['group_level'].str[:-4])

df

    group_level level     group
0     group0Low   Low    group0
1    group0High  High    group0
2     group1Low   Low    group1
3    group1High  High    group1
4      routeLow   Low     route
5     routeHigh  High     route
6   landmarkLow   Low  landmark
7  landmarkHigh  High  landmark

Upvotes: 3

Related Questions