codingknob
codingknob

Reputation: 11680

split string column by "_", drop the preceding text, recombine str by "_" in pandas

>e = {0: pd.Series(['NHL_toronto_maple-leafs_Canada', 'NHL_boston_bruins_US', 'NHL_detroit_red-wings', 'NHL_montreal'])}

>df = pd.DataFrame(e)

>df

    0
0   NHL_toronto_maple-leafs_Canada
1   NHL_boston_bruins_US
2   NHL_detroit_red-wings
3   NHL_montreal

I want to:

1) split the above dataframe (Series) by '_'

2) drop the 'NHL' string

3) recombine the remaining text by '_'

4) attach the result in #3 to the original dataframe as the second column

To do this I tried the following:

>df2 = df.icol(0).str.split('_').apply(pd.Series).iloc[:,1:]

>df2

    1   2   3
0   toronto maple-leafs Canada
1   boston  bruins  US
2   detroit red-wings   NaN
3   montreal    NaN NaN

I tried to follow the suggestion in combine columns in Pandas by doing something like:

>df2['4'] = df2.iloc[:,0] + "_" + df2.iloc[:,1] + "_" + df2.iloc[:,2]

>df2

    1   2   3   4
0   toronto maple-leafs Canada  toronto_maple-leafs_Canada
1   boston  bruins  US  boston_bruins_US
2   detroit red-wings   NaN NaN
3   montreal    NaN NaN NaN

However, you can see that in situations where a combine involves a cell that is NaN the end result is NaN as well. This is not what I want.

Column 4 should look like:

toronto_maple-leafs_Canada
boston_bruins_US
detroit_red-wings_US
montreal

Also is there an efficient way to do this type of operation as my real data set is quite large.

Upvotes: 0

Views: 1004

Answers (2)

fixxxer
fixxxer

Reputation: 16154

You could use apply like this :

In [1782]: df[0].apply(lambda v: '_'.join(v.split('_')[1:]))
Out[1782]: 
0    toronto_maple-leafs_Canada
1              boston_bruins_US
2             detroit_red-wings
3                      montreal
Name: 0, dtype: object

In [1783]: df[0] = df[0].apply(lambda v: '_'.join(v.split('_')[1:]))

Surprisingly, applying str seem to be taking longer than apply :

In [1811]: %timeit df[0].apply(lambda v: '_'.join(v.split('_')[1:]))
10000 loops, best of 3: 127 µs per loop

In [1810]: %timeit df[0].str[4:]
1000 loops, best of 3: 179 µs per loop

In [1812]: %timeit df[0].str.split('_').str[1:].str.join('_')
1000 loops, best of 3: 553 µs per loop

In [1813]: %timeit df[0].str.split("_", 1).str[1]
1000 loops, best of 3: 374 µs per loop

Upvotes: 1

Zero
Zero

Reputation: 76987

If you're just looking to remove starting 'NHL_' substring, you could just

In [84]: df[0].str[4:]
Out[84]:
0    toronto_maple-leafs_Canada
1              boston_bruins_US
2             detroit_red-wings
3                      montreal
Name: 0, dtype: object

However, if you need to split and join, you could use a string method like-

In [85]: df[0].str.split('_').str[1:].str.join('_')
Out[85]:
0    toronto_maple-leafs_Canada
1              boston_bruins_US
2             detroit_red-wings
3                      montreal
Name: 0, dtype: object

Alternatively, you could also use apply

In [86]: df[0].apply(lambda x: '_'.join(x.split('_')[1:])) # Also, x.split('_', 1)[1]
Out[86]:
0    toronto_maple-leafs_Canada
1              boston_bruins_US
2             detroit_red-wings
3                      montreal
Name: 0, dtype: object

And, as @DSM pointed out - "split accepts an argument for the maximum number of splits"

In [87]: df[0].str.split("_", 1).str[1]
Out[87]:
0    toronto_maple-leafs_Canada
1              boston_bruins_US
2             detroit_red-wings
3                      montreal
Name: 0, dtype: object

Depending on the size of your data, you could benchmark these methods and use appropriate one.

Upvotes: 2

Related Questions