Basj
Basj

Reputation: 46371

Display XML tree structure with BeautifulSoup

When working with a new XML structure, it is always helpful to see the big picture first.

When loading it with BeautifulSoup:

import requests, bs4
s = requests.get('https://www.w3schools.com/xml/cd_catalog.xml').text
x = bs4.BeautifulSoup(s, 'xml')
print(x)

is there a built-in way to display its tree structure with different depths?


Example for https://www.w3schools.com/xml/cd_catalog.xml, with maxdepth=0, it would be:

CATALOG

with maxdepth=1, it would be:

CATALOG
  CD 
  CD
  CD
  ...

and with maxdepth=2, it would be:

CATALOG
  CD 
    TITLE
    ARTIST
    COUNTRY
    COMPANY
    PRICE
    YEAR
  CD 
    TITLE
    ARTIST
    COUNTRY
    COMPANY
    PRICE
    YEAR
  ...

Upvotes: 3

Views: 3423

Answers (3)

Aramakus
Aramakus

Reputation: 1920

I have used xmltodict 0.12.0 (installed via anaconda), which did the job for xml parsing, not for depth-wise viewing though. Works much like any other dictionary. From here a recursion with depth counting should be a way to go.

import requests, xmltodict, json

s = requests.get('https://www.w3schools.com/xml/cd_catalog.xml').text
x = xmltodict.parse(s, process_namespaces=True)

for key in x:
    print(json.dumps(x[key], indent=4, default=str))

Upvotes: 1

joc
joc

Reputation: 199

Here is one solution without BeautifulSoup.

import requests
s = requests.get('https://www.w3schools.com/xml/cd_catalog.xml').text
array = []

tab_size = 2
target_depth = 2

for element in s.split('\n'):
    depth = (len(element) - len(element.lstrip())) / tab_size
    if depth <= target_depth:
        print(' ' * int(depth) + element)

Upvotes: 0

user6276743
user6276743

Reputation:

Here's a quick way to do it: Use the prettify() function to structure it, then get the indentation and opening tag names via regex (catches uppercase words inside opening tags in this case). If the indentation from pretify() meets the depth specification, then print it with the specified indentation size.

import requests, bs4
import re

maxdepth = 1
indent_size = 2
s = requests.get('https://www.w3schools.com/xml/cd_catalog.xml').text
x = bs4.BeautifulSoup(s, 'xml').prettify()

for line in x.split("\n"):
    match = re.match("(\s*)<([A-Z]+)>", line)
    if match and len(match.group(1)) <= maxdepth:
        print(indent_size*match.group(1) + match.group(2))

Upvotes: 2

Related Questions