Reputation: 3152
I have some <li>
tags with nested <a>
tags in an HTML file and have text in both the list and the <a>
tag. However, I want to extract them separately. I want to have the <li>
text become the value for the key tag
, and the text inside the <a>
tag become the value for the key for the children's tag
. (See below for HTML snippet)
I ultimately print this to a JSON file, but am getting unwanted results. The main tag
should only have "abstract visualization"... not all that other stuff. And the children's tag should only have "about", not "/ Emotive and abstract" following it. The "Emotive and abstract" has a place in the title already. You can see every entry of this index sample is showing the same pattern. How do I extract text to the right places? I am a beginner with Beautiful Soup. Thank you.
JSON file
{
"tag": "abstract visualization\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\nabout / Emotive and abstract\n\n",
"definition": "",
"source": [],
"children": [
{
"tag": "about / Emotive and abstract",
"definition": "",
"source": [
{
"title": "Emotive and abstract",
"href": "https://learning.oreilly.com/library/view/data-visualization-a/9781849693462/ch02s03.html"
}
]
}
]
},
{
"tag": "Adobe After Effects\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\nURL / Other specialist tools\n\n",
"definition": "",
"source": [],
"children": [
{
"tag": "URL / Other specialist tools",
"definition": "",
"source": [
{
"title": "Other specialist tools",
"href": "https://learning.oreilly.com/library/view/data-visualization-a/9781849693462/ch06.html"
}
]
}
]
},
HTML file snippet:
<ul id="letters">
<li>abstract visualization
<ul>
<li>about / <a href="ch02s03.html" title="Emotive and abstract" class="link">Emotive and abstract</a></li>
</ul>
</li>
<li>Adobe After Effects
<ul>
<li>URL / <a href="ch06.html" title="Other specialist tools" class="link">Other specialist tools</a></li>
</ul>
</li>
<li>Adobe Flash
<ul>
<li>about / <a href="ch06.html" title="Programming environments" class="link">Programming environments</a></li>
<li>URL / <a href="ch06.html" title="Programming environments" class="link">Programming environments</a></li>
</ul>
</li>
<li>Adobe Illustrator
<ul>
<li>about / <a href="ch06.html" title="Other specialist tools" class="link">Other specialist tools</a></li>
<li>URL / <a href="ch06.html" title="Other specialist tools" class="link">Other specialist tools</a></li>
</ul>
</li>
</ul>
Relevant code:
# convert html to bs4 object
def bs4_convert(file):
with open(file, encoding='utf8') as fp:
html = BeautifulSoup(fp, 'html.parser')
return html
# create a tag
def li_parser(letter, link_prefix):
tags = []
for li in letter.find_all('li', recursive=False):
tag = {
'tag': li.text,
'definition': '',
'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in li.find_all('a', recursive=False)]
}
if li.find('ul'):
tag['children'] = li_parser(li.find('ul'), link_prefix)
tags.append(tag)
return tags
# loop through all indices
def html_parser(html, link_prefix):
tags = []
# extract index
html.find(id='backindex')
# iterate over every indented letter in index
letters = html.find_all(attrs={'id': 'letters'})
for letter in letters:
tags += li_parser(letter, link_prefix)
return tags
tags = []
# parse the html
html = bs4_convert(course['file'])
# create tags
tags = html_parser(html, link_prefix)
# add course name as outermost tag
tags = add_course_tag(course['course'], tags)
Upvotes: 1
Views: 1055
Reputation: 25048
To get the right string for your tag you can go similar to @diggusbickus approache with stripped_strings
while picking first element:
'tag': list(li.stripped_strings)[0].strip(' /')
def li_parser(letter, link_prefix):
tags = []
for li in letter.find_all('li', recursive=False):
tag = {
'tag': list(li.stripped_strings)[0].strip(' /'),
'definition': '',
'source': [{'title': link.text, 'href': link_prefix + link['href']} for link in li.find_all('a', recursive=False)]
}
if li.find('ul'):
tag['children'] = li_parser(li.find('ul'), link_prefix)
tags.append(tag)
return tags
[{"tag": "abstract visualization", "definition": "", "source": [], "children": [{"tag": "about", "definition": "", "source": [{"title": "Emotive and abstract", "href": "https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/ch02s03.html"}]}]}, {"tag": "Adobe After Effects", "definition": "", "source": [], "children": [{"tag": "URL", "definition": "", "source": [{"title": "Other specialist tools", "href": "https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/ch06.html"}]}]}, {"tag": "Adobe Flash", "definition": "", "source": [], "children": [{"tag": "about", "definition": "", "source": [{"title": "Programming environments", "href": "https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/ch06.html"}]}, {"tag": "URL", "definition": "", "source": [{"title": "Programming environments", "href": "https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/ch06.html"}]}]}, {"tag": "Adobe Illustrator", "definition": "", "source": [], "children": [{"tag": "about", "definition": "", "source": [{"title": "Other specialist tools", "href": "https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/ch06.html"}]}, {"tag": "URL", "definition": "", "source": [{"title": "Other specialist tools", "href": "https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/ch06.html"}]}]}]
Upvotes: 1
Reputation: 2012
the children of a tag can be accessed through a list called contents
. in your case the text you're searching is just contents[0]
so it's easier than looping through all the children. you just have to remove unneeded tabs and lines with strip()
soup=BeautifulSoup(data, 'lxml')
lis=soup.select('#letters > li')
for li in lis:
print(li.contents[0].strip())
sub_li=li.select_one('ul li')
print(sub_li.contents[0].strip()[:-2]) #get rid of the trailing slash
which outputs
abstract visualization
about
Adobe After Effects
URL
Adobe Flash
about
Adobe Illustrator
about
Upvotes: 1