Scraping specific text form webpage

I am currently trying to scrape some graphs from a web page but I am new at this and don't know the best solutions.

<font color="DarkBLUE">
g:=Graph&lt;5|{ {2, 3}, {4, 5}, {1, 3}, {1, 2}, {1, 5}, {1, 4}, {2, 4}, {3, 5}, {2,
5}, {3, 4} }&gt;;</font>

What I need is the part g:=Graph<..>. Here is what I tried until now (basing it in some other similar questions):

tree = lh.fromstring(data)
rate = tree.xpath("//font[@color='DarkBLUE']")
graphurls.append(rate[0].text_content())

But the problem is that there are a lot of other things that it scrapes. I think it can be done since it has a unique pattern g:=Graph<...> so nothing else gets scraped.

Can you help me?

Upvotes: 2

Answers (3)

furas

Reputation: 142641

First method: you have string so you can use string's functions to filter results - ie.

if text.strip().startswith('g:=Graph') :

Example:

data = '''<font color="DarkBLUE">
g:=Graph&lt;5|{ {2, 3}, {4, 5}, {1, 3}, {1, 2}, {1, 5}, {1, 4}, {2, 4}, {3, 5}, {2,
5}, {3, 4} }&gt;;</font>

<font color="DarkBLUE">h:=Other&lt;...&gt;;</font>'''

import lxml.html as lh

tree = lh.fromstring(data)
rate = tree.xpath("//font[@color='DarkBLUE']")

for item in rate:
    text = item.text_content()
    text = text.strip()
    if text.startswith('g:=Graph'):
        print(' OK:', text)
    else:
        print('NOT:', text)

Second method: you can use xpath to filter it

tree.xpath("//font[@color='DarkBLUE' and contains(text(), 'g:=Graph')]")

tree.xpath("//font[@color='DarkBLUE'][contains(text(), 'g:=Graph')]")

Example:

data = '''<font color="DarkBLUE">
g:=Graph&lt;5|{ {2, 3}, {4, 5}, {1, 3}, {1, 2}, {1, 5}, {1, 4}, {2, 4}, {3, 5}, {2,
5}, {3, 4} }&gt;;</font>

<font color="DarkBLUE">h:=Other&lt;...&gt;;</font>'''

import lxml.html as lh

tree = lh.fromstring(data)
rate = tree.xpath("//font[@color='DarkBLUE' and contains(text(), 'g:=Graph')]")

for item in rate:
    text = item.text_content()
    text = text.strip()
    print(text)

Eventually with starts-with() but text in data is in new line so text in xpath needs \n at start

tree.xpath("//font[@color='DarkBLUE' and starts-with(text(), '\ng:=Graph')]")

BTW: xpath cheatsheet

Upvotes: 1

0buz

Reputation: 3503

One way is via regex:

import re

graphs=re.findall("g:=.*;;", rate[0].text_content())

This captures all matches starting with "g:=" and ending with ";;". It looks for such matches in string rate[0].text_content().

Note: Apply this to strings i.e. .text_content(), NOT to raw HTML.

Upvotes: 1

iggy12345

Reputation: 1383

I'd try using a regular expression https://docs.python.org/3/library/re.html, you can use https://regex101.com/ to experiment until you find the right formula

specifically, you can use capture groups (\{\d+,\s*\d+\},?\s*)+ to find the repeating sequence of

"{2, 3}, {4, 5}, {1, 3}, {1, 2}, {1, 5}, {1, 4}, {2, 4}, {3, 5},..."

I re-read your question, and you might have already known all of that, but you can use the regular expression in beautiful soup as well https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-regular-expression

Upvotes: 1

Scraping specific text form webpage

Answers (3)

Related Questions