lamwaiman1988
lamwaiman1988

Reputation: 3742

How to match start and end of line across multiple lines

I took some text from a website by beautifulsoup, at the beginning it looks like this:

9-Day Weather Forecast

General Situation: An anticyclone aloft over the northern part of the South China Sea will bring mainly fine and hot weather to the south China coast in the next few days. Under the influence of a trough of low pressure, there will be showers over southern China midweek next week.

Date/Month 18/5 (Friday)

Wind: South force 3.

Weather: Fine and hot.

Temp Range: 27 - 32 C

R.H. Range: 65 - 85 Per Cent

Date/Month 19/5(Saturday)

Wind: South force 3.

Weather: Fine and hot.

Temp Range: 27 - 32 C

R.H. Range: 65 - 85 Per Cent

I want to separate each part between "Date/Month" and "Per Cent" which spanned across several lines. I got a NavigableString by looking up a large string within a html tag. I tried but I could not search a NavigableString by re, so I turned the string into unicode string by:

daily_forecast_text = str(daily_forecast_text.encode('utf-8'))

It returned as below:

b'\r\n9-Day Weather Forecast\n\nGeneral Situation:\nAn anticyclone aloft over the northern part of the South\nChina Sea will bring mainly fine and very hot weather to the\nsouth China coast in the next few days. Under the influence\nof a trough of low pressure, there will be showers over\nsouthern China midweek next week.\n\nDate/Month 18/5 (Friday)\nWind: South force 2 to 3.\nWeather: Fine. Very hot during the day.\nTemp Range: 27 - 33 C\nR.H. Range: 60 - 85 Per Cent\n\nDate/Month 19/5(Saturday)\nWind: South force 2 to 3.\nWeather: Fine. Very hot during the day.\nTemp Range: 27 - 33 C\nR.H. Range: 60 - 85 Per Cent\n\nDate/Month 20/5(Sunday)\nWind: South force 2 to 3.\nWeather: Fine. Very hot during the day.\nTemp Range: 28 - 33 C\nR.H. Range: 65 - 85 Per Cent\n\nDate/Month 21/5(Monday)\nWind: Southwest force 3.\nWeather: Fine. Very hot during the day.\nTemp Range: 28 - 33 C\nR.H. Range: 65 - 85 Per Cent\n\nDate/Month 22/5(Tuesday)\nWind: Southwest force 2 to 3.\nWeather: Mainly fine and very hot. Isolated showers later.\nTemp Range: 28 - 33 C\nR.H. Range: 70 - 90 Per Cent\n\nDate/Month 23/5(Wednesday)\nWind: Light winds force 2.\nWeather: Sunny intervals and a few showers.\nTemp Range: 27 - 31 C\nR.H. Range: 70 - 95 Per Cent\n\nDate/Month 24/5(Thursday)\nWind: South force 2 to 3.\nWeather: Hot with sunny periods and a few showers.\nTemp Range: 27 - 32 C\nR.H. Range: 70 - 90 Per Cent\n\nDate/Month 25/5(Friday)\nWind: South force 3.\nWeather: Hot with sunny periods and one or two showers.\nTemp Range: 27 - 32 C\nR.H. Range: 70 - 90 Per Cent\n\nDate/Month 26/5(Saturday)\nWind: South force 3 to 4.\nWeather: Hot with sunny periods and one or two showers.\nTemp Range: 27 - 32 C\nR.H. Range: 70 - 90 Per Cent\n\nSea surface temperature at 2 p.m.17/5/2018 at North Point\nwas 27 degrees C.\n\nSoil temperatures at 7 a.m.17/5/2018 at the Hong Kong\nObservatory:\n0.5 M below surface was 27.7 degrees C.\n1.0 M below surface was 26.6 degrees C.\n\nWeather Cartoons for 9-day weather forecast\nDay 1 cartoon no. 90 - Hot\nDay 2 cartoon no. 90 - Hot\nDay 3 cartoon no. 90 - Hot\nDay 4 cartoon no. 90 - Hot\nDay 5 cartoon no. 90 - Hot\nDay 6 cartoon no. 54 - Sunny Intervals with Showers\nDay 7 cartoon no. 53 - Sunny Periods with A Few Showers\nDay 8 cartoon no. 53 - Sunny Periods with A Few Showers\nDay 9 cartoon no. 53 - Sunny Periods with A Few Showers\n'

The following code returned nothing:

 result = re.findall(
            "^Date.+Cent$", daily_forecast_text, flags=re.MULTILINE | re.DOTALL)

The following code got all the text, but it returned a large string started with the first "Date/Month" and end with the last "Per Cent".

 result = re.findall(
                "Date.+Cent", daily_forecast_text, flags=re.MULTILINE | re.DOTALL)

Upvotes: 1

Views: 56

Answers (2)

wwii
wwii

Reputation: 23743

Html with your text:

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p id="weather">9-Day Weather Forecast

General Situation: An anticyclone aloft over the northern part of the South China Sea will bring mainly fine and hot weather to the south China coast in the next few days. Under the influence of a trough of low pressure, there will be showers over southern China midweek next week.

Date/Month 18/5 (Friday)

Wind: South force 3.

Weather: Fine and hot.

Temp Range: 27 - 32 C

R.H. Range: 65 - 85 Per Cent

Date/Month 19/5(Saturday)

Wind: South force 3.

Weather: Fine and hot.

Temp Range: 27 - 32 C

R.H. Range: 65 - 85 Per Cent
</p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body></html>
"""

Get the tag

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
tag = soup.find(id='weather')

Even though tag.string is a bs4 NavigableString it is also a Python str

>>> 
>>> type(tag.string)
<class 'bs4.element.NavigableString'>
>>> isinstance(tag.string, str)
True
>>> 'South force 3' in tag.string
True
>>> 

No need to convert to search with a regular expression

pattern = r'Date/Month.*?Per Cent'
rex = re.compile(pattern, flags = re.DOTALL)
for match in rex.findall(tag.string):
    print(match)
    print('**************')

>>>
Date/Month 18/5 (Friday)

Wind: South force 3.

Weather: Fine and hot.

Temp Range: 27 - 32 C

R.H. Range: 65 - 85 Per Cent
**************
Date/Month 19/5(Saturday)

Wind: South force 3.

Weather: Fine and hot.

Temp Range: 27 - 32 C

R.H. Range: 65 - 85 Per Cent
**************
>>> 

Upvotes: 2

Taku
Taku

Reputation: 33714

.+ should be non-greedy, add a ? after them.

result = re.findall(
            "Date.+?Cent", daily_forecast_text, flags=re.DOTALL)

Upvotes: 1

Related Questions