Economist_Ayahuasca
Economist_Ayahuasca

Reputation: 1642

Regular expressions from a list previously specified

I am trying the following: from each article print the month only which is located in either the 4th or the 5th line. The way I am attempting to do so is by:

m = 'January', 'February', 'March', 'April', 'May' 'June', 'July', 'August', 'September', 'October', 'Novemeber', 'December'

for i in range(len(sections)):

        date = re.search(r"[m]",sections[i][1:5])

        print(date)

First problem. I do not know how to search for a regular expression in my list "m". Second problem, I want to focus my search only in lines 0-5 of each article.

Upvotes: 0

Views: 51

Answers (2)

dawg
dawg

Reputation: 104102

Given:

>>> txt='''\
... Line 1
... Line 2
... Line 3
... Line 4
... Line 5 April'''

You can get the i through j line with .splitlines()[i:j]:

>>> txt.splitlines()[0:3]
['Line 1', 'Line 2', 'Line 3']

Now just construct a pattern that finds the months. Be sure to use \b to find whole word matches:

>>> months=['January', 'February', 'March', 'April', 'May' 'June', 'July', 'August', 'September', 'October', 'Novemeber', 'December']
>>> pat=re.compile("|".join([r"\b{}\b".format(m) for m in months]), re.M)

Then search with your pattern in the slice of target lines:

>>> pat.search("\n".join(txt.splitlines()[0:5]))
<_sre.SRE_Match object at 0x107a2a9f0>

If you want to capture the line it appears on, you might do something like THIS

Upvotes: 2

midori
midori

Reputation: 4837

It depends on what sections is, i assume it's a multiline string:

import re

sections = 'some sections here'
dates = re.findall('\\b'+'\\b|\\b'.join(m), ' '.join(sections.splitlines()[0:4]))

Upvotes: 1

Related Questions