Filter strings which composed by 2 digits values in Python

Question

Given a dataset as follows:

   id            vector_name
0   1            01,02,03,04
1   2            001,002,003
2   3               01,02,03
3   4                A, B, C
4   5          s01, s02, s02
5   6  E2702-2703,E2702-2703
6   7               03,05,06
7   8  05-08,09,10-12, 05-08

How could I write a regex to filter out the string rows in column vector_name which are not composed by two digits values: the correct format should be 01, 02, 03, ... etc. Otherwise, returns invalid vector name for check column.

The expected result will be like this:

   id            vector_name
0   1            01,02,03,04
1   2    invalid vector name
2   3               01,02,03
3   4    invalid vector name
4   5    invalid vector name
5   6    invalid vector name
6   7               03,05,06
7   8  05-08,09,10-12, 05-08

The pattern I used: (\d+)(,\s*\d+)*, but it consider 001,002,003 as valid.

How could I do that? Thanks.

Wiktor Stribiżew · Accepted Answer

You can use

^\d{2}(?:-\d{2})?(?:,\s*\d{2}(?:-\d{2})?)*\Z

See the regex demo. Details

^ - start of string
\d{2} - two digits
(?:-\d{2})? - an optional sequence of - and two digits
(?:,\s*\d{2}(?:-\d{2})?)* - zero or more repetitions of
- , - a comma
- \s* - 0 or more whitespaces
- \d{2}(?:-\d{2})? - two digits and an optional sequence of - and two digits
\Z - the very end of the string.

Python Pandas test:

import pandas as pd
df = pd.DataFrame({
  'id':[1,2,3,4,5,6,7,8],
  'vector_name':
    [
      '01,02,03,04',
      '1002003',
      '01,02,03',
      'A, B, C',
      's01, s02, s02',
      'E2702-2703,E2702-2703',
      '03,05,06',
      '05-08,09,10-12, 05-08'
    ]
})
pattern = r'^\d{2}(?:-\d{2})?(?:,\s*\d{2}(?:-\d{2})?)*\Z'
df.loc[~df['vector_name'].str.contains(pattern), "check"] = "invalid vector name"
>>> df
   id            vector_name                check
0   1            01,02,03,04                  NaN
1   2                1002003  invalid vector name
2   3               01,02,03                  NaN
3   4                A, B, C  invalid vector name
4   5          s01, s02, s02  invalid vector name
5   6  E2702-2703,E2702-2703  invalid vector name
6   7               03,05,06                  NaN
7   8  05-08,09,10-12, 05-08                  NaN

Filter strings which composed by 2 digits values in Python

Answers (1)

Related Questions