ah bon
ah bon

Reputation: 10011

Filter strings which composed by 2 digits values in Python

Given a dataset as follows:

   id            vector_name
0   1            01,02,03,04
1   2            001,002,003
2   3               01,02,03
3   4                A, B, C
4   5          s01, s02, s02
5   6  E2702-2703,E2702-2703
6   7               03,05,06
7   8  05-08,09,10-12, 05-08

How could I write a regex to filter out the string rows in column vector_name which are not composed by two digits values: the correct format should be 01, 02, 03, ... etc. Otherwise, returns invalid vector name for check column.

The expected result will be like this:

   id            vector_name
0   1            01,02,03,04
1   2    invalid vector name
2   3               01,02,03
3   4    invalid vector name
4   5    invalid vector name
5   6    invalid vector name
6   7               03,05,06
7   8  05-08,09,10-12, 05-08

The pattern I used: (\d+)(,\s*\d+)*, but it consider 001,002,003 as valid.

How could I do that? Thanks.

Upvotes: 0

Views: 312

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626748

You can use

^\d{2}(?:-\d{2})?(?:,\s*\d{2}(?:-\d{2})?)*\Z

See the regex demo. Details

  • ^ - start of string
  • \d{2} - two digits
  • (?:-\d{2})? - an optional sequence of - and two digits
  • (?:,\s*\d{2}(?:-\d{2})?)* - zero or more repetitions of
    • , - a comma
    • \s* - 0 or more whitespaces
    • \d{2}(?:-\d{2})? - two digits and an optional sequence of - and two digits
  • \Z - the very end of the string.

Python Pandas test:

import pandas as pd
df = pd.DataFrame({
  'id':[1,2,3,4,5,6,7,8],
  'vector_name':
    [
      '01,02,03,04',
      '1002003',
      '01,02,03',
      'A, B, C',
      's01, s02, s02',
      'E2702-2703,E2702-2703',
      '03,05,06',
      '05-08,09,10-12, 05-08'
    ]
})
pattern = r'^\d{2}(?:-\d{2})?(?:,\s*\d{2}(?:-\d{2})?)*\Z'
df.loc[~df['vector_name'].str.contains(pattern), "check"] = "invalid vector name"
>>> df
   id            vector_name                check
0   1            01,02,03,04                  NaN
1   2                1002003  invalid vector name
2   3               01,02,03                  NaN
3   4                A, B, C  invalid vector name
4   5          s01, s02, s02  invalid vector name
5   6  E2702-2703,E2702-2703  invalid vector name
6   7               03,05,06                  NaN
7   8  05-08,09,10-12, 05-08                  NaN

Upvotes: 1

Related Questions