Reputation: 11680
>e = {0: pd.Series(['NHL_toronto_maple-leafs_Canada', 'NHL_boston_bruins_US', 'NHL_detroit_red-wings', 'NHL_montreal'])}
>df = pd.DataFrame(e)
>df
0
0 NHL_toronto_maple-leafs_Canada
1 NHL_boston_bruins_US
2 NHL_detroit_red-wings
3 NHL_montreal
I want to:
1) split the above dataframe (Series) by '_'
2) drop the 'NHL' string
3) recombine the remaining text by '_'
4) attach the result in #3 to the original dataframe as the second column
To do this I tried the following:
>df2 = df.icol(0).str.split('_').apply(pd.Series).iloc[:,1:]
>df2
1 2 3
0 toronto maple-leafs Canada
1 boston bruins US
2 detroit red-wings NaN
3 montreal NaN NaN
I tried to follow the suggestion in combine columns in Pandas by doing something like:
>df2['4'] = df2.iloc[:,0] + "_" + df2.iloc[:,1] + "_" + df2.iloc[:,2]
>df2
1 2 3 4
0 toronto maple-leafs Canada toronto_maple-leafs_Canada
1 boston bruins US boston_bruins_US
2 detroit red-wings NaN NaN
3 montreal NaN NaN NaN
However, you can see that in situations where a combine involves a cell that is NaN the end result is NaN as well. This is not what I want.
Column 4 should look like:
toronto_maple-leafs_Canada
boston_bruins_US
detroit_red-wings_US
montreal
Also is there an efficient way to do this type of operation as my real data set is quite large.
Upvotes: 0
Views: 1004
Reputation: 16154
You could use apply
like this :
In [1782]: df[0].apply(lambda v: '_'.join(v.split('_')[1:]))
Out[1782]:
0 toronto_maple-leafs_Canada
1 boston_bruins_US
2 detroit_red-wings
3 montreal
Name: 0, dtype: object
In [1783]: df[0] = df[0].apply(lambda v: '_'.join(v.split('_')[1:]))
Surprisingly, applying str
seem to be taking longer than apply
:
In [1811]: %timeit df[0].apply(lambda v: '_'.join(v.split('_')[1:]))
10000 loops, best of 3: 127 µs per loop
In [1810]: %timeit df[0].str[4:]
1000 loops, best of 3: 179 µs per loop
In [1812]: %timeit df[0].str.split('_').str[1:].str.join('_')
1000 loops, best of 3: 553 µs per loop
In [1813]: %timeit df[0].str.split("_", 1).str[1]
1000 loops, best of 3: 374 µs per loop
Upvotes: 1
Reputation: 76987
If you're just looking to remove starting 'NHL_'
substring, you could just
In [84]: df[0].str[4:]
Out[84]:
0 toronto_maple-leafs_Canada
1 boston_bruins_US
2 detroit_red-wings
3 montreal
Name: 0, dtype: object
However, if you need to split and join, you could use a string method
like-
In [85]: df[0].str.split('_').str[1:].str.join('_')
Out[85]:
0 toronto_maple-leafs_Canada
1 boston_bruins_US
2 detroit_red-wings
3 montreal
Name: 0, dtype: object
Alternatively, you could also use apply
In [86]: df[0].apply(lambda x: '_'.join(x.split('_')[1:])) # Also, x.split('_', 1)[1]
Out[86]:
0 toronto_maple-leafs_Canada
1 boston_bruins_US
2 detroit_red-wings
3 montreal
Name: 0, dtype: object
And, as @DSM pointed out - "split
accepts an argument for the maximum number of splits"
In [87]: df[0].str.split("_", 1).str[1]
Out[87]:
0 toronto_maple-leafs_Canada
1 boston_bruins_US
2 detroit_red-wings
3 montreal
Name: 0, dtype: object
Depending on the size of your data, you could benchmark these methods and use appropriate one.
Upvotes: 2