Reputation: 10011
Given a dataset as follows:
id vector_name
0 1 01,02,03,04
1 2 001,002,003
2 3 01,02,03
3 4 A, B, C
4 5 s01, s02, s02
5 6 E2702-2703,E2702-2703
6 7 03,05,06
7 8 05-08,09,10-12, 05-08
How could I write a regex to filter out the string rows in column vector_name
which are not composed by two digits values: the correct format should be 01, 02, 03, ...
etc. Otherwise, returns invalid vector name
for check
column.
The expected result will be like this:
id vector_name
0 1 01,02,03,04
1 2 invalid vector name
2 3 01,02,03
3 4 invalid vector name
4 5 invalid vector name
5 6 invalid vector name
6 7 03,05,06
7 8 05-08,09,10-12, 05-08
The pattern I used: (\d+)(,\s*\d+)*
, but it consider 001,002,003
as valid.
How could I do that? Thanks.
Upvotes: 0
Views: 312
Reputation: 626748
You can use
^\d{2}(?:-\d{2})?(?:,\s*\d{2}(?:-\d{2})?)*\Z
See the regex demo. Details
^
- start of string\d{2}
- two digits(?:-\d{2})?
- an optional sequence of -
and two digits(?:,\s*\d{2}(?:-\d{2})?)*
- zero or more repetitions of
,
- a comma\s*
- 0 or more whitespaces\d{2}(?:-\d{2})?
- two digits and an optional sequence of -
and two digits\Z
- the very end of the string.Python Pandas test:
import pandas as pd
df = pd.DataFrame({
'id':[1,2,3,4,5,6,7,8],
'vector_name':
[
'01,02,03,04',
'1002003',
'01,02,03',
'A, B, C',
's01, s02, s02',
'E2702-2703,E2702-2703',
'03,05,06',
'05-08,09,10-12, 05-08'
]
})
pattern = r'^\d{2}(?:-\d{2})?(?:,\s*\d{2}(?:-\d{2})?)*\Z'
df.loc[~df['vector_name'].str.contains(pattern), "check"] = "invalid vector name"
>>> df
id vector_name check
0 1 01,02,03,04 NaN
1 2 1002003 invalid vector name
2 3 01,02,03 NaN
3 4 A, B, C invalid vector name
4 5 s01, s02, s02 invalid vector name
5 6 E2702-2703,E2702-2703 invalid vector name
6 7 03,05,06 NaN
7 8 05-08,09,10-12, 05-08 NaN
Upvotes: 1