urawesome
urawesome

Reputation: 198

regex for python (version number - date format)

I have a file like this. The format of the version is <version> space(s) dash space(s) date. I want to create a dictionary with 4.11.1 - 2020-02-25 as key and everything after that before 3.25.0 - 2019-01-01 as value and so on till the end of the file.

##################
Some texts

4.11.1 - 2020-02-25
-------------------

*some text

** Some more text

3.25.0 - 2019-01-01
-------------------

*some text

** Some more text

This is what I tried:

result ={}
matches = re.findall(r'([\d.]+[^\n]+)\s*(.*?)(?=\s*[\d.]+[^\n]+|$)', Text, re.S)
for match in matches:
    result[match[0]] = match[1]
 print(result)

It works for most of the cases. But it also prints these as keys :

.com/sth/sth/sth/6)

1.8.2 (https://github.com/sth/sth/sth/5)

1.8.1.

20160918 (see commands under 'some text')

. text text tex

Upvotes: 0

Views: 105

Answers (1)

The fourth bird
The fourth bird

Reputation: 163277

You could use 2 capturing groups, and instead of using re.S use re.M

The pattern will capture in group 1 a version and space(s) dash space(s) using \d+(?:.\d+)+ +- + followed by a date like pattern \d{4}-\d{2}-\d{2}

Note that is does not validate a date itself. This page shows how you can make that date pattern more specific.

The capture group 2 matches all lines that do not start with 1+ digits, a dot and a digit. You can make that part more specific if you want.

^(\d+(?:\.\d+)+ +- +\d{4}-\d{2}-\d{2})\r?\n((?:(?!\d+\.\d).*(?:\r?\n|$))*)

Regex demo

import re

result ={}
Text = ("##################\n"
    "Some texts\n\n"
    "4.11.1 - 2020-02-25\n"
    "-------------------\n\n"
    "*some text\n\n"
    "** Some more text\n\n"
    "3.25.0 - 2019-01-01\n"
    "-------------------\n\n"
    "*some text\n\n"
    "** Some more text")
matches = re.findall(r'^(\d+(?:\.\d+)+ +- +\d{4}-\d{2}-\d{2})\r?\n((?:(?!\d+\.\d).*(?:\r?\n|$))*)', Text, re.M)

for match in matches:
    result[match[0]] = match[1]
print(result)

Output

{'4.11.1 - 2020-02-25': '-------------------\n\n*some text\n\n** Some more text\n\n', '3.25.0 - 2019-01-01': '-------------------\n\n*some text\n\n** Some more text'}

Upvotes: 3

Related Questions