NDTB
NDTB

Reputation: 97

Using regex in python to obtain multiple repeating lines

I'm very new to RegEx and have a very large text file, a small portion of which is shown below:

<div class="hbk-preamble " id="preamble-APG5180">
<div class="hbk-preamble-entry">
<div class="hbk-preamble-icon hbk-preamble-icon_mode"></div>
<p class="hbk-preamble-heading">Offered</p>
<p><a href="index-bylocation-city-melbourne.html">City (Melbourne)</a></p><ul class="hbk-preamble-list__offerings"><li>Summer semester A 2019 (Flexible)</li></ul><p><a href="index-bylocation-clayton.html">Clayton</a></p><ul class="hbk-preamble-list__offerings"><li>First semester 2019 (On-campus)</li></ul>
</div>
</div>
<div class="notes">
<p class="hbk-heading hdg_6">Notes</p>
<p></p><ul>
<li>The unit may be offered as part of the <a class="hbk-screen-url" href="http://www.monash.edu/students/courses/arts/summer-program.html">Summer Arts Program</a><span class="hbk-print-url">Summer Arts Program (<a href="http://www.monash.edu/students/courses/arts/summer-program.html">http://www.monash.edu/students/courses/arts/summer-program.html</a>)</span>.</li>
<li>For more information please visit the <a class="hbk-screen-url" href="https://www.anzsog.edu.au/">ANZSOG webpage</a><span class="hbk-print-url">ANZSOG webpage (<a href="https://www.anzsog.edu.au/">https://www.anzsog.edu.au/</a>)</span>.</li>
</ul>
</div>
<h2 class="hbk-heading">Synopsis</h2>
<div>
<p>The media is one of the most important components of any political society. In a liberal democracy like Australia, its role and function have profound implications for the conduct of politics, the nature of democracy and public policy outcomes. In this unit, the relationship between the media, politics and public policy is studied from three broad perspectives. First, the politics of the media is investigated from the perspective of liberal democratic theory in order to understand the role of news media on the policy debate. Second, the political economy of the media is investigated. Particular emphasis is on the structure and operation of media organisations and journalists and how political news is covered. Third, the unit undertakes a study of the relationship between the media and political actors. Particular emphasis is on the use of public relations and 'spin doctors' in managing the media as well as the utilisation of political advertising and strategic political communication by governments and political agents.</p>
</div>
<h2 class="hbk-heading">Outcomes</h2>
<div>
<p>Upon successful completion of the unit students should have:</p>
<ol princestart="0" start="1" type="1">

I would like to use RegEx to get only the 'Synopsis' text out of it:

The media is one of the most important components of any political society. In a liberal democracy like Australia, its role and function have profound implications for the conduct of politics, the nature of democracy and public policy outcomes. In this unit, the relationship between the media, politics and public policy is studied from three broad perspectives. First, the politics of the media is investigated from the perspective of liberal democratic theory in order to understand the role of news media on the policy debate. Second, the political economy of the media is investigated. Particular emphasis is on the structure and operation of media organisations and journalists and how political news is covered. Third, the unit undertakes a study of the relationship between the media and political actors. Particular emphasis is on the use of public relations and 'spin doctors' in managing the media as well as the utilisation of political advertising and strategic political communication by governments and political agents.

I need the synopsis text out for every section in the text file, what should I do?

So far, I've read in my text file using read and readlines, but I can't establish a pattern to get started.

Upvotes: 0

Views: 43

Answers (2)

AnnaB
AnnaB

Reputation: 121

I would recommend the package beautifulsoup to do this. You could try something like this:

import requests
from bs4 import BeautifulSoup
data = requests.get('put website address here')
soup = BeautifulSoup(data.text, 'html.parser')
for i in soup.find_all('h2', {'class':'hbk-heading'}):
    print(i.text.strip())

Upvotes: 1

nowox
nowox

Reputation: 29096

I will start by not answering your question directly. I assume your question is a X-Y problem. In your case you have to deal with HTML, so you have plenty of powerful tools made for that.

Take a look at BeautifulSoup for Python:

from bs4 import BeautifulSoup
soup = BeautifulSoup(content, 'html.parser')

From this soup you can then extract whatever you need.

Now from your question, if you still want to use regular expressions you can use https://regex101.com to help you:

Demo: https://regex101.com/r/AcozoW/1

<p.*?Notes.*?<li>(.+?)<\/li>

Upvotes: 1

Related Questions