Dari Obukhova
Dari Obukhova

Reputation: 142

RegEx for extracting specific textContent in HTML tags

I need to create a Python program that receives the HTML file from the standard input and outputs the names of the species displayed under Mammals to the standard output line by line using regext. I also do not need to output the item displayed as "#sequence_only".

The file used for standard input is this:

   <!DOCTYPE html>

  <!-- The following setting enables collapsible lists -->
  <p>
  <a href="#human">Human</a></p>

  <p class="collapse-section">
  <a class="collapsed collapse-toggle" data-toggle="collapse" 
  href=#mammals>Mammals</a>
  <div class="collapse" id="mammals">
  <ul>
  <li><a href="#alpaca">Alpaca</a>
  <li><a href="#armadillo">Armadillo</a>
  <li><a href="#sequence_only">Armadillo</a> (sequence only)
  <li><a href="#baboon">Baboon</a>
  <li><a href="#bison">Bison</a>
  <li><a href="#bonobo">Bonobo</a>
  <li><a href="#brown_kiwi">Brown kiwi</a>
  <li><a href="#bushbaby">Bushbaby</a>
  <li><a href="#sequence_only">Bushbaby</a> (sequence only)
  <li><a href="#cat">Cat</a>
  <li><a href="#chimp">Chimpanzee</a>
  <li><a href="#chinese_hamster">Chinese hamster</a>
  <li><a href="#chinese_pangolin">Chinese pangolin</a>
  <li><a href="#cow">Cow</a>
  <li><a href="#crab-eating_macaque">Crab-eating_macaque</a>
  <div class="gbFooterCopyright">
  &copy; 2017 The Regents of the University of California. All 
  Rights Reserved.
  <br>
  <a href="https://genome.ucsc.edu/conditions.html">Conditions of 
  Use</a>
  </div>

My logics is as following. I want to parse the value of href. If the line starts with < li > and the value of href starts from "#" --> it is a species name and I need to extract name between ">< characters. If the value of href starts from "https", I want to re.sub it with some other character and do not input in the final output.

I tried to create the regex for extracting mammals name.

#!usr/bin/env python3

import sys
import re

html = sys.stdin.readlines()

for line in html:

    mammal_name = re.search(r'\"\>(.*?)\<', line)

if mammal_name:

    print(mammal_name.group())

I wanted output like:

Alpaca
Armadillo
Baboon

I got output like:

">Human<
">Alpaca<
">Armadillo<
">Armadillo<
">Baboon<

I do not want Human to be in output as the line it is at does not start with < li >. Moreover, I do not want the repetitions in my output but for that I need to access value of href but I am struggling with this part.

UPDATE: My grader shows me message like this: "If you enclose species name in tags, it will be displayed in italics in many browsers, so the staff who wanted to display scientific names in italics probably used tags. In any case, it is inappropriate as a species name, so please remove it". I guess it is about >(species name)<, so I need to replace >< between which species name is with some other character, probably [] and do parsing for my regex after that??

Upvotes: 3

Views: 1138

Answers (4)

Emma
Emma

Reputation: 27723

Here, we just want to add two left (<li><a.+?>) and right boundaries (<\/.+>), then swipe our desired outputs and save it in $1 capturing group ():

<li><a.+?>(.+)?<\/.+>

Test

# -*- coding: UTF-8 -*-
import re

string = """
!-- The following setting enables collapsible lists -->
  <p>
  <a href="#human">Human</a></p>

  <p class="collapse-section">
  <a class="collapsed collapse-toggle" data-toggle="collapse" 
  href=#mammals>Mammals</a>
  <div class="collapse" id="mammals">
  <ul>
  <li><a href="#alpaca">Alpaca</a>
  <li><a href="#armadillo">Armadillo</a>
  <li><a href="#sequence_only">Armadillo</a> (sequence only)
  <li><a href="#baboon">Baboon</a>
  <li><a href="#bison">Bison</a>
  <li><a href="#bonobo">Bonobo</a>
  <li><a href="#brown_kiwi">Brown kiwi</a>
  <li><a href="#bushbaby">Bushbaby</a>
  <li><a href="#sequence_only">Bushbaby</a> (sequence only)
  <li><a href="#cat">Cat</a>
  <li><a href="#chimp">Chimpanzee</a>
  <li><a href="#chinese_hamster">Chinese hamster</a>
  <li><a href="#chinese_pangolin">Chinese pangolin</a>
  <li><a href="#cow">Cow</a>
  <li><a href="#crab-eating_macaque">Crab-eating_macaque</a>
  <div class="gbFooterCopyright">
  &copy; 2017 The Regents of the University of California. All 
  Rights Reserved.
  <br>
  <a href="https://genome.ucsc.edu/conditions.html">Conditions of 
  Use</a>
  </div>
"""
expression = r'<li><a.+?>(.+)?<\/.+>'
match = re.search(expression, string)
if match:
    print("YAAAY! \"" + match.group(1) + "\" is a match 💚💚💚 ")
else: 
    print('🙀 Sorry! No matches!')

Output

YAAAY! "Alpaca" is a match 💚💚💚 

RegEx

If this expression wasn't desired, it can be modified or changed in regex101.com.

enter image description here

RegEx Circuit

jex.im also helps to visualize the expressions.

enter image description here


Edit:

To exclude, sequence_only, we can modify our expression to:

<li.+?#[^s].+?>(.+)?<\/.+>

Demo

Python

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

test_str = '''

<!DOCTYPE html>

  <!-- The following setting enables collapsible lists -->
  <p>
  <a href="#human">Human</a></p>

  <p class="collapse-section">
  <a class="collapsed collapse-toggle" data-toggle="collapse" 
  href=#mammals>Mammals</a>
  <div class="collapse" id="mammals">
  <ul>
  <li><a href="#alpaca">Alpaca</a>
  <li><a href="#armadillo">Armadillo</a>
  <li><a href="#sequence_only">Armadillo</a> (sequence only)
  <li><a href="#baboon">Baboon</a>
  <li><a href="#bison">Bison</a>
  <li><a href="#bonobo">Bonobo</a>
  <li><a href="#brown_kiwi">Brown kiwi</a>
  <li><a href="#bushbaby">Bushbaby</a>
  <li><a href="#sequence_only">Bushbaby</a> (sequence only)
  <li><a href="#cat">Cat</a>
  <li><a href="#chimp">Chimpanzee</a>
  <li><a href="#chinese_hamster">Chinese hamster</a>
  <li><a href="#chinese_pangolin">Chinese pangolin</a>
  <li><a href="#cow">Cow</a>
  <li><a href="#crab-eating_macaque">Crab-eating_macaque</a>
  <div class="gbFooterCopyright">
  &copy; 2017 The Regents of the University of California. All 
  Rights Reserved.
  <br>
  <a href="https://genome.ucsc.edu/conditions.html">Conditions of 
  Use</a>
  </div>

'''
regex = r"<li.+?#[^s].+?>(.+)?<\/.+>"
find_matches = re.findall(regex, test_str)
for matches in find_matches:
    print(matches)

Output

Alpaca
Armadillo
Baboon
Bison
Bonobo
Brown kiwi
Bushbaby
Cat
Chimpanzee
Chinese hamster
Chinese pangolin
Cow
Crab-eating_macaque

Upvotes: 2

Olvin Roght
Olvin Roght

Reputation: 7812

Your should add some details to your regex to parse correct strings. Regex test website.

Input:

string = '''   <!DOCTYPE html>

  <!-- The following setting enables collapsible lists -->
  <p>
  <a href="#human">Human</a></p>

  <p class="collapse-section">
  <a class="collapsed collapse-toggle" data-toggle="collapse" 
  href=#mammals>Mammals</a>
  <div class="collapse" id="mammals">
  <ul>
  <li><a href="#alpaca">Alpaca</a>
  <li><a href="#armadillo">Armadillo</a>
  <li><a href="#sequence_only">Armadillo</a> (sequence only)
  <li><a href="#baboon">Baboon</a>
  <li><a href="#bison">Bison</a>
  <li><a href="#bonobo">Bonobo</a>
  <li><a href="#brown_kiwi">Brown kiwi</a>
  <li><a href="#bushbaby">Bushbaby</a>
  <li><a href="#sequence_only">Bushbaby</a> (sequence only)
  <li><a href="#cat">Cat</a>
  <li><a href="#chimp">Chimpanzee</a>
  <li><a href="#chinese_hamster">Chinese hamster</a>
  <li><a href="#chinese_pangolin">Chinese pangolin</a>
  <li><a href="#cow">Cow</a>
  <li><a href="#crab-eating_macaque">Crab-eating_macaque</a>
  <div class="gbFooterCopyright">
  &copy; 2017 The Regents of the University of California. All 
  Rights Reserved.
  <br>
  <a href="https://genome.ucsc.edu/conditions.html">Conditions of 
  Use</a>
  </div>'''

If you want to process all text in one expression you should use findall(). Code:

results = re.findall("<li><a href=\"(?:(?!#sequence_only).)*\">(.*)</a>", string)
for s in results:
    print(s)

If you want to check it line by line, you can use search(). Code:

strings = string.splitlines()
for s in strings:
    substring = re.search("<li><a href=\"(?:(?!#sequence_only).)*\">(.*)</a>", s)
    if substring:
        print(substring.group(1))

Output:

Alpaca
Armadillo
Baboon
Bison
Bonobo
Brown kiwi
Bushbaby
Cat
Chimpanzee
Chinese hamster
Chinese pangolin
Cow
Crab-eating_macaque

Upvotes: 0

Alaa Aqeel
Alaa Aqeel

Reputation: 645

use re.findall to get all tags text text like this

pattern = r'<li><a.*>(.*)</a>'
find = re.findall(pattern, string)
if find:
    print(find)

out put

['Alpaca', 'Armadillo', 'Armadillo', 'Baboon', 'Bison', 'Bonobo', 'Brown kiwi', 
'Bushbaby', 'Bushbaby', 'Cat', 'Chimpanzee', 'Chinese hamster', 'Chinese pangolin', 
'Cow', 'Crab-eating_macaque']

Upvotes: 0

Kaies LAMIRI
Kaies LAMIRI

Reputation: 199

Use BeautifulSoup, it is a powerful package for html parsing:

import re
import codecs

from bs4 import BeautifulSoup as soup
from lxml import html

# Change with your input file 
input_html = "D:\/input.html"

with codecs.open(input_html, 'r', "utf-8") as f :
    page = f.read()
f.close()
#html parsing
page_soup = soup(page, "html.parser")

#extract document seperator:
divTag = page_soup.find_all("div", {"id": "mammals"})

for tag in divTag:
    mammals = tag.find_all("a", href = re.compile(r'#(?!sequence_only$)'))
    for tag in mammals:
        print(tag.text)

Output :

Alpaca
Armadillo
Baboon
Bison
Bonobo
Brown kiwi
Bushbaby
Cat
Chimpanzee
Chinese hamster
Chinese pangolin
Cow
Crab-eating_macaque


Upvotes: 0

Related Questions