Reputation: 142
I need to create a Python program that receives the HTML file from the standard input and outputs the names of the species displayed under Mammals to the standard output line by line using regext. I also do not need to output the item displayed as "#sequence_only".
The file used for standard input is this:
<!DOCTYPE html>
<!-- The following setting enables collapsible lists -->
<p>
<a href="#human">Human</a></p>
<p class="collapse-section">
<a class="collapsed collapse-toggle" data-toggle="collapse"
href=#mammals>Mammals</a>
<div class="collapse" id="mammals">
<ul>
<li><a href="#alpaca">Alpaca</a>
<li><a href="#armadillo">Armadillo</a>
<li><a href="#sequence_only">Armadillo</a> (sequence only)
<li><a href="#baboon">Baboon</a>
<li><a href="#bison">Bison</a>
<li><a href="#bonobo">Bonobo</a>
<li><a href="#brown_kiwi">Brown kiwi</a>
<li><a href="#bushbaby">Bushbaby</a>
<li><a href="#sequence_only">Bushbaby</a> (sequence only)
<li><a href="#cat">Cat</a>
<li><a href="#chimp">Chimpanzee</a>
<li><a href="#chinese_hamster">Chinese hamster</a>
<li><a href="#chinese_pangolin">Chinese pangolin</a>
<li><a href="#cow">Cow</a>
<li><a href="#crab-eating_macaque">Crab-eating_macaque</a>
<div class="gbFooterCopyright">
© 2017 The Regents of the University of California. All
Rights Reserved.
<br>
<a href="https://genome.ucsc.edu/conditions.html">Conditions of
Use</a>
</div>
My logics is as following. I want to parse the value of href. If the line starts with < li > and the value of href starts from "#" --> it is a species name and I need to extract name between ">< characters. If the value of href starts from "https", I want to re.sub it with some other character and do not input in the final output.
I tried to create the regex for extracting mammals name.
#!usr/bin/env python3
import sys
import re
html = sys.stdin.readlines()
for line in html:
mammal_name = re.search(r'\"\>(.*?)\<', line)
if mammal_name:
print(mammal_name.group())
I wanted output like:
Alpaca
Armadillo
Baboon
I got output like:
">Human<
">Alpaca<
">Armadillo<
">Armadillo<
">Baboon<
I do not want Human to be in output as the line it is at does not start with < li >. Moreover, I do not want the repetitions in my output but for that I need to access value of href but I am struggling with this part.
UPDATE: My grader shows me message like this: "If you enclose species name in tags, it will be displayed in italics in many browsers, so the staff who wanted to display scientific names in italics probably used tags. In any case, it is inappropriate as a species name, so please remove it". I guess it is about >(species name)<, so I need to replace >< between which species name is with some other character, probably [] and do parsing for my regex after that??
Upvotes: 3
Views: 1138
Reputation: 27723
Here, we just want to add two left (<li><a.+?>
) and right boundaries (<\/.+>
), then swipe our desired outputs and save it in $1
capturing group ()
:
<li><a.+?>(.+)?<\/.+>
# -*- coding: UTF-8 -*-
import re
string = """
!-- The following setting enables collapsible lists -->
<p>
<a href="#human">Human</a></p>
<p class="collapse-section">
<a class="collapsed collapse-toggle" data-toggle="collapse"
href=#mammals>Mammals</a>
<div class="collapse" id="mammals">
<ul>
<li><a href="#alpaca">Alpaca</a>
<li><a href="#armadillo">Armadillo</a>
<li><a href="#sequence_only">Armadillo</a> (sequence only)
<li><a href="#baboon">Baboon</a>
<li><a href="#bison">Bison</a>
<li><a href="#bonobo">Bonobo</a>
<li><a href="#brown_kiwi">Brown kiwi</a>
<li><a href="#bushbaby">Bushbaby</a>
<li><a href="#sequence_only">Bushbaby</a> (sequence only)
<li><a href="#cat">Cat</a>
<li><a href="#chimp">Chimpanzee</a>
<li><a href="#chinese_hamster">Chinese hamster</a>
<li><a href="#chinese_pangolin">Chinese pangolin</a>
<li><a href="#cow">Cow</a>
<li><a href="#crab-eating_macaque">Crab-eating_macaque</a>
<div class="gbFooterCopyright">
© 2017 The Regents of the University of California. All
Rights Reserved.
<br>
<a href="https://genome.ucsc.edu/conditions.html">Conditions of
Use</a>
</div>
"""
expression = r'<li><a.+?>(.+)?<\/.+>'
match = re.search(expression, string)
if match:
print("YAAAY! \"" + match.group(1) + "\" is a match 💚💚💚 ")
else:
print('🙀 Sorry! No matches!')
YAAAY! "Alpaca" is a match 💚💚💚
If this expression wasn't desired, it can be modified or changed in regex101.com.
jex.im also helps to visualize the expressions.
Edit:
To exclude, sequence_only
, we can modify our expression to:
<li.+?#[^s].+?>(.+)?<\/.+>
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
test_str = '''
<!DOCTYPE html>
<!-- The following setting enables collapsible lists -->
<p>
<a href="#human">Human</a></p>
<p class="collapse-section">
<a class="collapsed collapse-toggle" data-toggle="collapse"
href=#mammals>Mammals</a>
<div class="collapse" id="mammals">
<ul>
<li><a href="#alpaca">Alpaca</a>
<li><a href="#armadillo">Armadillo</a>
<li><a href="#sequence_only">Armadillo</a> (sequence only)
<li><a href="#baboon">Baboon</a>
<li><a href="#bison">Bison</a>
<li><a href="#bonobo">Bonobo</a>
<li><a href="#brown_kiwi">Brown kiwi</a>
<li><a href="#bushbaby">Bushbaby</a>
<li><a href="#sequence_only">Bushbaby</a> (sequence only)
<li><a href="#cat">Cat</a>
<li><a href="#chimp">Chimpanzee</a>
<li><a href="#chinese_hamster">Chinese hamster</a>
<li><a href="#chinese_pangolin">Chinese pangolin</a>
<li><a href="#cow">Cow</a>
<li><a href="#crab-eating_macaque">Crab-eating_macaque</a>
<div class="gbFooterCopyright">
© 2017 The Regents of the University of California. All
Rights Reserved.
<br>
<a href="https://genome.ucsc.edu/conditions.html">Conditions of
Use</a>
</div>
'''
regex = r"<li.+?#[^s].+?>(.+)?<\/.+>"
find_matches = re.findall(regex, test_str)
for matches in find_matches:
print(matches)
Alpaca
Armadillo
Baboon
Bison
Bonobo
Brown kiwi
Bushbaby
Cat
Chimpanzee
Chinese hamster
Chinese pangolin
Cow
Crab-eating_macaque
Upvotes: 2
Reputation: 7812
Your should add some details to your regex to parse correct strings. Regex test website.
Input:
string = ''' <!DOCTYPE html>
<!-- The following setting enables collapsible lists -->
<p>
<a href="#human">Human</a></p>
<p class="collapse-section">
<a class="collapsed collapse-toggle" data-toggle="collapse"
href=#mammals>Mammals</a>
<div class="collapse" id="mammals">
<ul>
<li><a href="#alpaca">Alpaca</a>
<li><a href="#armadillo">Armadillo</a>
<li><a href="#sequence_only">Armadillo</a> (sequence only)
<li><a href="#baboon">Baboon</a>
<li><a href="#bison">Bison</a>
<li><a href="#bonobo">Bonobo</a>
<li><a href="#brown_kiwi">Brown kiwi</a>
<li><a href="#bushbaby">Bushbaby</a>
<li><a href="#sequence_only">Bushbaby</a> (sequence only)
<li><a href="#cat">Cat</a>
<li><a href="#chimp">Chimpanzee</a>
<li><a href="#chinese_hamster">Chinese hamster</a>
<li><a href="#chinese_pangolin">Chinese pangolin</a>
<li><a href="#cow">Cow</a>
<li><a href="#crab-eating_macaque">Crab-eating_macaque</a>
<div class="gbFooterCopyright">
© 2017 The Regents of the University of California. All
Rights Reserved.
<br>
<a href="https://genome.ucsc.edu/conditions.html">Conditions of
Use</a>
</div>'''
If you want to process all text in one expression you should use findall()
. Code:
results = re.findall("<li><a href=\"(?:(?!#sequence_only).)*\">(.*)</a>", string)
for s in results:
print(s)
If you want to check it line by line, you can use search()
. Code:
strings = string.splitlines()
for s in strings:
substring = re.search("<li><a href=\"(?:(?!#sequence_only).)*\">(.*)</a>", s)
if substring:
print(substring.group(1))
Output:
Alpaca
Armadillo
Baboon
Bison
Bonobo
Brown kiwi
Bushbaby
Cat
Chimpanzee
Chinese hamster
Chinese pangolin
Cow
Crab-eating_macaque
Upvotes: 0
Reputation: 645
use re.findall
to get all tags text text
like this
pattern = r'<li><a.*>(.*)</a>'
find = re.findall(pattern, string)
if find:
print(find)
out put
['Alpaca', 'Armadillo', 'Armadillo', 'Baboon', 'Bison', 'Bonobo', 'Brown kiwi',
'Bushbaby', 'Bushbaby', 'Cat', 'Chimpanzee', 'Chinese hamster', 'Chinese pangolin',
'Cow', 'Crab-eating_macaque']
Upvotes: 0
Reputation: 199
Use BeautifulSoup, it is a powerful package for html parsing:
import re
import codecs
from bs4 import BeautifulSoup as soup
from lxml import html
# Change with your input file
input_html = "D:\/input.html"
with codecs.open(input_html, 'r', "utf-8") as f :
page = f.read()
f.close()
#html parsing
page_soup = soup(page, "html.parser")
#extract document seperator:
divTag = page_soup.find_all("div", {"id": "mammals"})
for tag in divTag:
mammals = tag.find_all("a", href = re.compile(r'#(?!sequence_only$)'))
for tag in mammals:
print(tag.text)
Output :
Alpaca
Armadillo
Baboon
Bison
Bonobo
Brown kiwi
Bushbaby
Cat
Chimpanzee
Chinese hamster
Chinese pangolin
Cow
Crab-eating_macaque
Upvotes: 0