Reputation: 3432
hello I have a df such as
COL1
NW_011625257.1_0
NW_011623521.1_1
NW_011623521.3_1
NW_011623521.4_1
NW_011623521.1
JZSA01007324.1_2
scaffold_1463_2
scaffold_1463
and I would like to cut by the last '_'
and get
COL1 COL2
NW_011625257.1 0
NW_011623521.1 1
NW_011623521.3 1
NW_011623521.4 1
NW_011623521.1 NaN
JZSA01007324.1 2
scaffold_1463 2
scaffold_1463 NaN
So far i tried :
df[['COL1','COL2']] = df.COL1.str.split(r'_(?!.*_)', expand=True)
instead I get this kind of output:
COL1 COL2
NW_011625257.1 0
NW_011623521.1 1
NW_011623521.3 1
NW_011623521.4 1
NW 011623521.1
JZSA01007324.1 2
scaffold_1463 2
scaffold 1463
here is an exemple of what I want to select
Upvotes: 2
Views: 64
Reputation: 627536
You can use
df[['COL1','COL2']] = df.COL1.str.split(r"(?<=\d)_(?=\d+$)", expand=True)
See the regex demo
Pattern details:
(?<=\d)
- there must a digit right before the current location_
- an underscore(?=\d+$)
- there must be 1+ digits and end of string immediately to the right of the current location.Pandas test:
df = pd.DataFrame({'COL1': ['NW_011625257.1_0','NW_011623521.1_1','NW_011623521.3_1','NW_011623521.4_1','NW_011623521.1','JZSA01007324.1_2','scaffold_1463_2','scaffold_1463']})
>>> df[['COL2','COL3']] = df.COL1.str.split(r"(?<=\d)_(?=\d+$)", expand=True)
>>> df
COL1 COL2 COL3
0 NW_011625257.1_0 NW_011625257.1 0
1 NW_011623521.1_1 NW_011623521.1 1
2 NW_011623521.3_1 NW_011623521.3 1
3 NW_011623521.4_1 NW_011623521.4 1
4 NW_011623521.1 NW_011623521.1 None
5 JZSA01007324.1_2 JZSA01007324.1 2
6 scaffold_1463_2 scaffold_1463 2
7 scaffold_1463 scaffold_1463 None
Upvotes: 2