cnuvadga
cnuvadga

Reputation: 109

How to Split String with Selected White Spaces Python

I am trying to split the following string in python. Is it possible to achieve the below output given the corresponding input ?

Input

Platforms: Linux Applies to versions: 10.0 Upgrades to: 10.0 Severity: 10 - High Impact/High Probability of Occurrence \Categories: Availability, Compatibility, Data, Function, Performance, Security Vulnerability (Sec/Int), Serviceability, Usability Abstract: SqlGuard Patch 10.0p4052 Sniffer Update

Output

Platforms: Linux
Applies to versions: 10.0
Upgrades to: 10.0
Severity: 10 - High Impact/High Probability of Occurrence 
Categories: Availability, Compatibility, Data, Function, Performance, Security Vulnerability (Sec/Int), Serviceability, Usability 
Abstract: SqlGuard Patch 10.0p4052 Sniffer Update

Upvotes: 1

Views: 111

Answers (2)

cdlane
cdlane

Reputation: 41925

Since the other answers rely on a known list of fields, let's try a solution that doesn't know the fields a priori:

import re

string = r"Platforms: Linux Applies to versions: 10.0 Upgrades to: 10.0 Severity: 10 - High Impact/High Probability of Occurrence \Categories: Availability, Compatibility, Data, Function, Performance, Security Vulnerability (Sec/Int), Serviceability, Usability Abstract: SqlGuard Patch 10.0p4052 Sniffer Update"

iterable = iter(re.split(r"([A-Z][a-z ]+:)", string)[1:])  # "Applies to versions:"

for field in iterable:
    print(field, next(iterable), sep='')

OUTPUT

> python3 test.py
Platforms: Linux 
Applies to versions: 10.0 
Upgrades to: 10.0 
Severity: 10 - High Impact/High Probability of Occurrence \
Categories: Availability, Compatibility, Data, Function, Performance, Security Vulnerability (Sec/Int), Serviceability, Usability 
Abstract: SqlGuard Patch 10.0p4052 Sniffer Update
>

Can you please explain the logic behind the regex?

We're doing an re.split(), but with retention parentheses so that whatever pattern we split on gets kept as well. The pattern for all the field names is the same, e.g. "Applies to versions:"

(  # retain split pattern match
[A-Z]  # starts with a capital letter
[a-z ]+  # continues with lower case letters and spaces
:  # a colon marks the end of the field name
)

When we do the re.split(), the string actually begins with a pattern match which causes re.split() to return an empty field ahead of the first item, thus the re.split(...)[1:] to toss that first empty item. We now have a list of field names and field bodies which we walk in pairs using an iterator.

Upvotes: 2

wim
wim

Reputation: 363486

Since the fields are fixed, split on the fields instead of whitespace:

>>> fields = [
...     "Platforms: ",
...     "Applies to versions: ",
...     "Upgrades to: ",
...     "Severity: ",
...     "Categories: ",
...     "Abstract: ",
... ]
>>> import re
>>> for k,v in zip(fields, re.split("|".join(fields), s)[1:]):
...     print(k + v)
...
Platforms: Linux
Applies to versions: 10.0
Upgrades to: 10.0
Severity: 10 - High Impact/High Probability of Occurrence
Categories: Availability, Compatibility, Data, Function, Performance, Security Vulnerability (Sec/Int), Serviceability, Usability
Abstract: SqlGuard Patch 10.0p4052 Sniffer Update

Upvotes: 2

Related Questions