chippycentra
chippycentra

Reputation: 3432

Regex split by in pandas

hello I have a df such as

COL1 
NW_011625257.1_0
NW_011623521.1_1
NW_011623521.3_1
NW_011623521.4_1
NW_011623521.1
JZSA01007324.1_2
scaffold_1463_2
scaffold_1463

and I would like to cut by the last '_' and get

COL1              COL2
NW_011625257.1    0
NW_011623521.1    1
NW_011623521.3    1
NW_011623521.4    1
NW_011623521.1    NaN 
JZSA01007324.1    2
scaffold_1463     2
scaffold_1463     NaN

So far i tried :

df[['COL1','COL2']] = df.COL1.str.split(r'_(?!.*_)', expand=True)

instead I get this kind of output:

COL1             COL2
NW_011625257.1   0
NW_011623521.1   1
NW_011623521.3   1
NW_011623521.4   1
NW               011623521.1
JZSA01007324.1   2
scaffold_1463    2
scaffold         1463

here is an exemple of what I want to select

enter image description here

Upvotes: 2

Views: 64

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627536

You can use

df[['COL1','COL2']] = df.COL1.str.split(r"(?<=\d)_(?=\d+$)", expand=True)

See the regex demo

Pattern details:

  • (?<=\d) - there must a digit right before the current location
  • _ - an underscore
  • (?=\d+$) - there must be 1+ digits and end of string immediately to the right of the current location.

Pandas test:

df = pd.DataFrame({'COL1': ['NW_011625257.1_0','NW_011623521.1_1','NW_011623521.3_1','NW_011623521.4_1','NW_011623521.1','JZSA01007324.1_2','scaffold_1463_2','scaffold_1463']})
>>> df[['COL2','COL3']] = df.COL1.str.split(r"(?<=\d)_(?=\d+$)", expand=True)
>>> df
               COL1            COL2  COL3
0  NW_011625257.1_0  NW_011625257.1     0
1  NW_011623521.1_1  NW_011623521.1     1
2  NW_011623521.3_1  NW_011623521.3     1
3  NW_011623521.4_1  NW_011623521.4     1
4    NW_011623521.1  NW_011623521.1  None
5  JZSA01007324.1_2  JZSA01007324.1     2
6   scaffold_1463_2   scaffold_1463     2
7     scaffold_1463   scaffold_1463  None

Upvotes: 2

Related Questions