RJ_Singh
RJ_Singh

Reputation: 115

Split multiple lines by regex

I'm trying to split multiple lines of a segment from an ttl document, here's the relevant code.

entry_obj = str(Entry(*re.findall(r'([;\s]+[^\s+|\s+$])', ''.join(buf))))
            yield process_entry_obj(entry_obj)

The code returns the error and as it is not able to split the string, the number of matching arguments are different every time and code doesn't run.

Below is my file format:

 File input

 ##  http://www.example.com/abc#AAA
                pms:ecCreatedBy rms:type ;
                rmfs:lag "Ersteller"@newyork ,
                "AAA"@wdc .

There are multiple entries like above in the file.

Upvotes: 0

Views: 491

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626690

You may use

import re

s = "" # File contents
with open(filepath, 'r') as fr:
    s =fr.read()
s = re.sub(r'(?m)(rmfs:label\s*)("[^"]*"@(?!en)\w*)(\s*,\s*)("[^"]*"@en) \.$', r'\1\4\3\2 .', s)
s = re.sub(r'(?m)^(\s*###\s*http.*/v\d+#)\w*((?:\n(?!\n).*)*rmfs:label\s*")([^"]*)("@en)', r'\1\3\2\3\4', s)
# Wrtie to file:
with open(filepath, 'w') as fw:
    fw.write(s)

See the Python demo.

Here are the Regex 1 and Regex 2 demos.

Regex 1 details

  • (?m) - multiline mode, $ will match end of a line
  • (rmfs:label\s*) - Group 1 (\1): rmfs:label and then 0+ whitespaces
  • ("[^"]*"@(?!en)\w*) - Group 2 (\2): ", 0+ non-" chars, "@, a lookahead check ensuring no en immediately to the right of the current position, and then 0+ word chars
  • (\s*,\s*) - Group 3 (\3): a , enclosed with 0+ whitespaces
  • ("[^"]*"@en) - Group 4 (\4): ", 0+ chars other than ", " and @en
  • .$ - space, ., end of line.

Regex 2 details

  • (?m) - multiline mnode, ^ matche line start
  • ^ - start of a line
  • (\s*###\s*http.*/v\d+#) - Group 1: 0+ whitespaces, ###, 0+ whitespaces, http, any 0+ chars, /v, 1+ digits and #
  • \w* - 0+ word chars
  • ((?:\n(?!\n).*)*rmfs:label\s*") - Group 2: any amount of lines before a double line break ((?:\n(?!\n).*)*) and then rmfs:label, 0+ whitespaces and "
  • ([^"]*) - Group 3: any 0+ chars other than "
  • ("@en) - Group 4: "@en siubstring.

Upvotes: 1

Michał Turczyn
Michał Turczyn

Reputation: 37337

From what I understand you need \s*;\s*

Explanation:

\s* - match whitespace character zero or more times

; - match ; literally

Demo

Upvotes: 1

Related Questions