user308827
user308827

Reputation: 21961

Extracting year from string in python

How can I parse the foll. in python to extract the year:

'years since 1250-01-01 0:0:0'

The answer should be 1250

Upvotes: 7

Views: 22283

Answers (3)

idjaw
idjaw

Reputation: 26580

You can use a regex with a capture group around the four digits, while also making sure you have a particular pattern around it. I would probably look for something that:

  • 4 digits and a capture (\d{4})

  • hyphen -

  • two digits \d{2}

  • hyphen -

  • two digits \d{2}

Giving: (\d{4})-\d{2}-\d{2}

Demo:

>>> import re
>>> d = re.findall('(\d{4})-\d{2}-\d{2}', 'years since 1250-01-01 0:0:0')
>>> d
['1250']
>>> d[0]
'1250'

if you need it as an int, just cast it as such:

>>> int(d[0])
1250

Upvotes: 5

alecxe
alecxe

Reputation: 473803

There are all sorts of ways to do it, here are several options:

  • dateutil parser in a "fuzzy" mode:

    In [1]: s = 'years since 1250-01-01 0:0:0'
    
    In [2]: from dateutil.parser import parse
    
    In [3]: parse(s, fuzzy=True).year  # resulting year would be an integer
    Out[3]: 1250
    
  • regular expressions with a capturing group:

    In [2]: import re
    
    In [3]: re.search(r"years since (\d{4})", s).group(1)
    Out[3]: '1250'
    
  • splitting by "since" and then by a dash:

    In [2]: s.split("since", 1)[1].split("-", 1)[0].strip()
    Out[2]: '1250'
    
  • or may be even splitting by the first dash and slicing the first substring:

    In [2]: s.split("-", 1)[0][-4:]
    Out[2]: '1250'
    

The last two involve more "moving parts" and might not be applicable depending on possible variations of the input string.

Upvotes: 24

Tim Biegeleisen
Tim Biegeleisen

Reputation: 520948

The following regex should make the four digit year available as the first capture group:

^.*\(d{4})-\d{2}-\d{2}.*$

Upvotes: 2

Related Questions