Reputation: 109
I am trying to split the following string in python. Is it possible to achieve the below output given the corresponding input ?
Input
Platforms: Linux Applies to versions: 10.0 Upgrades to: 10.0 Severity: 10 - High Impact/High Probability of Occurrence \Categories: Availability, Compatibility, Data, Function, Performance, Security Vulnerability (Sec/Int), Serviceability, Usability Abstract: SqlGuard Patch 10.0p4052 Sniffer Update
Output
Platforms: Linux
Applies to versions: 10.0
Upgrades to: 10.0
Severity: 10 - High Impact/High Probability of Occurrence
Categories: Availability, Compatibility, Data, Function, Performance, Security Vulnerability (Sec/Int), Serviceability, Usability
Abstract: SqlGuard Patch 10.0p4052 Sniffer Update
Upvotes: 1
Views: 111
Reputation: 41925
Since the other answers rely on a known list of fields, let's try a solution that doesn't know the fields a priori:
import re
string = r"Platforms: Linux Applies to versions: 10.0 Upgrades to: 10.0 Severity: 10 - High Impact/High Probability of Occurrence \Categories: Availability, Compatibility, Data, Function, Performance, Security Vulnerability (Sec/Int), Serviceability, Usability Abstract: SqlGuard Patch 10.0p4052 Sniffer Update"
iterable = iter(re.split(r"([A-Z][a-z ]+:)", string)[1:]) # "Applies to versions:"
for field in iterable:
print(field, next(iterable), sep='')
OUTPUT
> python3 test.py
Platforms: Linux
Applies to versions: 10.0
Upgrades to: 10.0
Severity: 10 - High Impact/High Probability of Occurrence \
Categories: Availability, Compatibility, Data, Function, Performance, Security Vulnerability (Sec/Int), Serviceability, Usability
Abstract: SqlGuard Patch 10.0p4052 Sniffer Update
>
Can you please explain the logic behind the regex?
We're doing an re.split()
, but with retention parentheses so that whatever pattern we split on gets kept as well. The pattern for all the field names is the same, e.g. "Applies to versions:"
( # retain split pattern match
[A-Z] # starts with a capital letter
[a-z ]+ # continues with lower case letters and spaces
: # a colon marks the end of the field name
)
When we do the re.split()
, the string actually begins with a pattern match which causes re.split()
to return an empty field ahead of the first item, thus the re.split(...)[1:]
to toss that first empty item. We now have a list of field names and field bodies which we walk in pairs using an iterator.
Upvotes: 2
Reputation: 363486
Since the fields are fixed, split on the fields instead of whitespace:
>>> fields = [
... "Platforms: ",
... "Applies to versions: ",
... "Upgrades to: ",
... "Severity: ",
... "Categories: ",
... "Abstract: ",
... ]
>>> import re
>>> for k,v in zip(fields, re.split("|".join(fields), s)[1:]):
... print(k + v)
...
Platforms: Linux
Applies to versions: 10.0
Upgrades to: 10.0
Severity: 10 - High Impact/High Probability of Occurrence
Categories: Availability, Compatibility, Data, Function, Performance, Security Vulnerability (Sec/Int), Serviceability, Usability
Abstract: SqlGuard Patch 10.0p4052 Sniffer Update
Upvotes: 2