Reputation: 46371
When working with a new XML structure, it is always helpful to see the big picture first.
When loading it with BeautifulSoup
:
import requests, bs4
s = requests.get('https://www.w3schools.com/xml/cd_catalog.xml').text
x = bs4.BeautifulSoup(s, 'xml')
print(x)
is there a built-in way to display its tree structure with different depths?
Example for https://www.w3schools.com/xml/cd_catalog.xml, with maxdepth=0
, it would be:
CATALOG
with maxdepth=1
, it would be:
CATALOG
CD
CD
CD
...
and with maxdepth=2
, it would be:
CATALOG
CD
TITLE
ARTIST
COUNTRY
COMPANY
PRICE
YEAR
CD
TITLE
ARTIST
COUNTRY
COMPANY
PRICE
YEAR
...
Upvotes: 3
Views: 3423
Reputation: 1920
I have used xmltodict
0.12.0 (installed via anaconda), which did the job for xml parsing, not for depth-wise viewing though. Works much like any other dictionary. From here a recursion with depth counting should be a way to go.
import requests, xmltodict, json
s = requests.get('https://www.w3schools.com/xml/cd_catalog.xml').text
x = xmltodict.parse(s, process_namespaces=True)
for key in x:
print(json.dumps(x[key], indent=4, default=str))
Upvotes: 1
Reputation: 199
Here is one solution without BeautifulSoup
.
import requests
s = requests.get('https://www.w3schools.com/xml/cd_catalog.xml').text
array = []
tab_size = 2
target_depth = 2
for element in s.split('\n'):
depth = (len(element) - len(element.lstrip())) / tab_size
if depth <= target_depth:
print(' ' * int(depth) + element)
Upvotes: 0
Reputation:
Here's a quick way to do it: Use the prettify()
function to structure it, then get the indentation and opening tag names via regex (catches uppercase words inside opening tags in this case). If the indentation from pretify()
meets the depth specification, then print it with the specified indentation size.
import requests, bs4
import re
maxdepth = 1
indent_size = 2
s = requests.get('https://www.w3schools.com/xml/cd_catalog.xml').text
x = bs4.BeautifulSoup(s, 'xml').prettify()
for line in x.split("\n"):
match = re.match("(\s*)<([A-Z]+)>", line)
if match and len(match.group(1)) <= maxdepth:
print(indent_size*match.group(1) + match.group(2))
Upvotes: 2