Reputation: 333
I have a string like below:
[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31,
[Base Font : IOFOEO+Imago-Book, Font Size : 3.876, Font Weight : 0.0] [(X=307.5,Y=240.48499) height=3.876 width=2.9970093]respectively. The net decrease in the revenue
[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=49.5,Y=233.98499) height=3.5324998 width=2.5690002](US$ in millions)
I want to extract the value of "X" and the text associated and convert it to a list. Please, see the expected output below:
Expected Output:
['X=250.44','DECEMBER 31,']
['X=307.5','respectively. The net decrease in the revenue']
['X=49.5','(US$ in millions)']
How can we approach this in Python?
MyApproach:
mylist = []
for line in data.split("\n"):
if line.strip():
x_coord = re.findall('^(X=.*)\,$', line)
text = re.findall('^(]\w +)', line)
mylist.append([x_coord, text])
My approach does not identify any value for x_coord
and text
.
Upvotes: 3
Views: 74
Reputation: 3550
Try this:
(X=[^,]*)(?:.*])(.*)
import re
source = """[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31,
[Base Font : IOFOEO+Imago-Book, Font Size : 3.876, Font Weight : 0.0] [(X=307.5,Y=240.48499) height=3.876 width=2.9970093]respectively. The net decrease in the revenue
[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=49.5,Y=233.98499) height=3.5324998 width=2.5690002](US$ in millions)""".split('\n')
pattern = r"(X=[^,]*)(?:.*])(.*)"
for line in source:
print(re.search(pattern, line).groups())
Output:
('X=250.44', 'DECEMBER 31,')
('X=307.5', 'respectively. The net decrease in the revenue')
('X=49.5', '(US$ in millions)')
You have X=
in front of all captures, so I just did a capture group, feel free to add non-capturing group if that matters.
Upvotes: 2
Reputation: 1691
re
solution:
import re
input = [
"[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31,",
"[Base Font : IOFOEO+Imago-Book, Font Size : 3.876, Font Weight : 0.0] [(X=307.5,Y=240.48499) height=3.876 width=2.9970093]respectively. The net decrease in the revenue",
"[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=49.5,Y=233.98499) height=3.5324998 width=2.5690002](US$ in millions)",
]
def extract(s):
match = re.search("(X=\d+(?:\.\d*)?).*?\](.*?)$",s)
return match.groups()
output = [extract(item) for item in input]
print(output)
Output:
[
('X=250.44', 'DECEMBER 31,'),
('X=307.5', 'respectively. The net decrease in the revenue'),
('X=49.5', '(US$ in millions)'),
]
Explanation:
\d
... digit\d+
... one or more digits(?:...)
... non-capturing ("normal") parentheses\.\d*
... dot followed by zero or more digits(?:\.\d*)?
... optional (zero or one) "decimal part"(X=\d+(?:\.\d*)?)
... first group, X=number
.*?
... zero or more of any character (non-greedy)\]
... ]
symbol$
... end of string\](.*?)$
... second group, anything between ]
and end of stringUpvotes: 2
Reputation: 47870
Using regex with named groups to capture the relevant bits:
>>> line = "[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31,"
>>> m = re.search(r'(?:\(X=)(?P<x_coord>.*?)(?:,.*])(?P<text>.*)$', line)
>>> m.groups()
('250.44', 'DECEMBER 31,')
>>> m['x_coord']
'250.44'
>>> m['text']
'DECEMBER 31,'
Upvotes: 2