Reputation: 61
I am looking to extract a list of tuples from the following string:
text='''Consumer Price Index:
+0.2% in Sep 2020
Unemployment Rate:
+7.9% in Sep 2020
Producer Price Index:
+0.4% in Sep 2020
Employment Cost Index:
+0.5% in 2nd Qtr of 2020
Productivity:
+10.1% in 2nd Qtr of 2020
Import Price Index:
+0.3% in Sep 2020
Export Price Index:
+0.6% in Sep 2020'''
I am using 'import re' for the process.
The output should be something like: [('Consumer Price Index', '+0.2%', 'Sep 2020'), ...]
I want to use a re.findall function that produces the above output, so far I have this:
re.findall(r"(:\Z)\s+(%\Z+)(\Ain )", text)
Where I am identifying the characters prior to ':', then the characters prior to '%' and then the characters after 'in'.
I'm really just clueless on how to continue. Any help would be appreciated. Thanks!
Upvotes: 4
Views: 1136
Reputation: 754
Regex is not a good way to approach this. It gets hard to read and maintain very fast. It can be done much cleaner by using pythons string functions:
list_of_lines = [
line.strip() # remove trailing and leading whitespace
for line in text.split("\n") # split up the text into lines
if line # filter out the empty lines
]
list_of_lines
is now:
['Consumer Price Index:', '+0.2% in Sep 2020', 'Unemployment Rate:', '+7.9% in Sep 2020', 'Producer Price Index:', '+0.4% in Sep 2020', 'Employment Cost Index:', '+0.5% in 2nd Qtr of 2020', 'Productivity:', '+10.1% in 2nd Qtr of 2020', 'Import Price Index:', '+0.3% in Sep 2020', 'Export Price Index:', '+0.6% in Sep 2020']
now all we have to do is build tuples from pairs of elements of this list.
def pairwise(iterable):
"s -> (s0, s1), (s2, s3), (s4, s5), ..."
a = iter(iterable)
return zip(a, a)
(from here)
Now we can get our desired output:
print(pairwise(list_of_lines))
[('Consumer Price Index:', '+0.2% in Sep 2020'), ('Unemployment Rate:', '+7.9% in Sep 2020'), ('Producer Price Index:', '+0.4% in Sep 2020'), ('Employment Cost Index:', '+0.5% in 2nd Qtr of 2020'), ('Productivity:', '+10.1% in 2nd Qtr of 2020'), ('Import Price Index:', '+0.3% in Sep 2020'), ('Export Price Index:', '+0.6% in Sep 2020')]
Upvotes: 1
Reputation: 626896
You can use
re.findall(r'(\S.*):\n\s*(\+?\d[\d.]*%)\s+in\s+(.*)', text)
# => [('Consumer Price Index', '+0.2%', 'Sep 2020'), ('Unemployment Rate', '+7.9%', 'Sep 2020'), ('Producer Price Index', '+0.4%', 'Sep 2020'), ('Employment Cost Index', '+0.5%', '2nd Qtr of 2020'), ('Productivity', '+10.1%', '2nd Qtr of 2020'), ('Import Price Index', '+0.3%', 'Sep 2020'), ('Export Price Index', '+0.6%', 'Sep 2020')]
See the regex demo and the Python demo.
Details
(\S.*)
- Group 1: a non-whitespace char followed with any zero or more chars other than line break chars as many as possible:
- a colon\n
- a newline\s*
- 0 or more whitespaces(\+?\d[\d.]*%)
- Group 2: optional +
, a digit, zero or more digits/dots, and a %
\s+in\s+
- in
enclosed with 1+ whitespaces(.*)
- Group 3: any zero or more chars other than line break chars as many as possibleUpvotes: 5