Reputation: 180
I am currently trying to scrape some graphs from a web page but I am new at this and don't know the best solutions.
<font color="DarkBLUE">
g:=Graph<5|{ {2, 3}, {4, 5}, {1, 3}, {1, 2}, {1, 5}, {1, 4}, {2, 4}, {3, 5}, {2,
5}, {3, 4} }>;</font>
What I need is the part g:=Graph<..>
.
Here is what I tried until now (basing it in some other similar questions):
tree = lh.fromstring(data)
rate = tree.xpath("//font[@color='DarkBLUE']")
graphurls.append(rate[0].text_content())
But the problem is that there are a lot of other things that it scrapes. I think it can be done since it has a unique pattern g:=Graph<...>
so nothing else gets scraped.
Can you help me?
Upvotes: 2
Views: 227
Reputation: 142641
First method: you have string so you can use string's functions to filter results - ie.
if text.strip().startswith('g:=Graph') :
Example:
data = '''<font color="DarkBLUE">
g:=Graph<5|{ {2, 3}, {4, 5}, {1, 3}, {1, 2}, {1, 5}, {1, 4}, {2, 4}, {3, 5}, {2,
5}, {3, 4} }>;</font>
<font color="DarkBLUE">h:=Other<...>;</font>'''
import lxml.html as lh
tree = lh.fromstring(data)
rate = tree.xpath("//font[@color='DarkBLUE']")
for item in rate:
text = item.text_content()
text = text.strip()
if text.startswith('g:=Graph'):
print(' OK:', text)
else:
print('NOT:', text)
Second method: you can use xpath
to filter it
tree.xpath("//font[@color='DarkBLUE' and contains(text(), 'g:=Graph')]")
or
tree.xpath("//font[@color='DarkBLUE'][contains(text(), 'g:=Graph')]")
Example:
data = '''<font color="DarkBLUE">
g:=Graph<5|{ {2, 3}, {4, 5}, {1, 3}, {1, 2}, {1, 5}, {1, 4}, {2, 4}, {3, 5}, {2,
5}, {3, 4} }>;</font>
<font color="DarkBLUE">h:=Other<...>;</font>'''
import lxml.html as lh
tree = lh.fromstring(data)
rate = tree.xpath("//font[@color='DarkBLUE' and contains(text(), 'g:=Graph')]")
for item in rate:
text = item.text_content()
text = text.strip()
print(text)
Eventually with starts-with()
but text in data is in new line so text in xpath needs \n
at start
tree.xpath("//font[@color='DarkBLUE' and starts-with(text(), '\ng:=Graph')]")
BTW: xpath cheatsheet
Upvotes: 1
Reputation: 3503
One way is via regex
:
import re
graphs=re.findall("g:=.*;;", rate[0].text_content())
This captures all matches starting with "g:=" and ending with ";;". It looks for such matches in string rate[0].text_content()
.
Note:
Apply this to strings i.e. .text_content()
, NOT to raw HTML.
Upvotes: 1
Reputation: 1383
I'd try using a regular expression https://docs.python.org/3/library/re.html, you can use https://regex101.com/ to experiment until you find the right formula
specifically, you can use capture groups (\{\d+,\s*\d+\},?\s*)+
to find the repeating sequence of
"{2, 3}, {4, 5}, {1, 3}, {1, 2}, {1, 5}, {1, 4}, {2, 4}, {3, 5},..."
I re-read your question, and you might have already known all of that, but you can use the regular expression in beautiful soup as well https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-regular-expression
Upvotes: 1