Reputation: 367
I am creating a web scraper, and I have issue fetching the pages whose are most likely generated, like this:
<html>
<body>
<div >
<code>
<p class="nt"><my-component</p> <p class="na">v-bind:prop1=</p><p class="s">"parentValue"</p><p class="nt">></my-component></p>
<p class="c"><!-- Or more succinctly, --></p>
<p class="nt"><my-component</p> <p class="na">:prop1=</p><p class="s">"parentValue"</p><p class="nt">></my-component></p>
</code>
</div>
<div>
<code>
<p class="nt"><my-component</p> <p class="na">v-on:myEvent=</p><p class="s">"parentHandler"</p><p class="nt">></my-component></p>
<p class="c"><!-- Or more succinctly, --></p>
<p class="nt"><my-component</p> <p class="err">@</p><p class="na">myEvent=</p><p class="s">"parentHandler"</p><p class="nt">></my-component></p>
</code>
</div>
</body>
</html>
Most important is content between code
tags. Plan is to extract text between
tags (or , remove those
tags and keep the rest of the DOM as it is.
So I need output like this:
<html>
<body>
<div >
<code>
text text and more text
</code>
</div>
</html>
</body>
My tries as following..
from bs4 import BeautifulSoup
bs = BeautifulSoup(payload, 'lxml')
with open('/tmp/out.html', 'w+') as f:
for t in bs.find_all():
for q in t.find_all('code'):
# print(t.text, t.next_sibling)
f.write(q.text)
but this doesn't give good results.. From what I learned, bs main purpose is to extract elements, so that is reason why I tried recreating the dom in another file.
Thanks!
Upvotes: 0
Views: 241
Reputation: 6483
You can try this:
from bs4 import BeautifulSoup
payload='''
<html>
<body>
<div >
<code>
<p class="nt"><my-component</p> <p class="na">v-bind:prop1=</p><p class="s">"parentValue"</p><p class="nt">></my-component></p>
<p class="c"><!-- Or more succinctly, --></p>
<p class="nt"><my-component</p> <p class="na">:prop1=</p><p class="s">"parentValue"</p><p class="nt">></my-component></p>
</code>
</div>
<div>
<code>
<p class="nt"><my-component</p> <p class="na">v-on:myEvent=</p><p class="s">"parentHandler"</p><p class="nt">></my-component></p>
<p class="c"><!-- Or more succinctly, --></p>
<p class="nt"><my-component</p> <p class="err">@</p><p class="na">myEvent=</p><p class="s">"parentHandler"</p><p class="nt">></my-component></p>
</code>
</div>
</body>
</html>
'''
soup = BeautifulSoup(payload, 'lxml')
for match in soup.find_all('code'):
new_t=soup.new_tag('code')
new_t.string=match.text
match.replace_with(new_t)
with open(r'prove.html', "w") as file:
file.write(str(soup))
Output (prove.html):
<html>
<body>
<div>
<code>
<my-component v-bind:prop1="parentValue"></my-component>
<!-- Or more succinctly, -->
<my-component :prop1="parentValue"></my-component>
</code>
</div>
<div>
<code>
<my-component v-on:myEvent="parentHandler"></my-component>
<!-- Or more succinctly, -->
<my-component @myEvent="parentHandler"></my-component>
</code>
</div>
</body>
</html>
Upvotes: 1