PascaleFM
PascaleFM

Reputation: 33

How to extract xml tags with BeautifulSoup?

I am trying to extract the tags from this data:

[{"title":"Joshua Cohen","nid":"21706","type":"winner","changed":"1651960857","field_category":{"und":[{"tid":"219"}]},"field_year":{"und":[{"tid":"640"}]},"field_abbr_citation":{"und":[{"safe_value":"A mordant, linguistically deft historical novel about the ambiguities of the Jewish-American experience, presenting ideas and disputes as volatile as its tightly-wound plot."}]},"field_location_text":[],"field_publication":{"und":[{"safe_value":"The Netanyahus: An Account of a Minor and Ultimately Even Negligible Episode in the History of a Very Famous Family"}]},"field_publisher":{"und":[{"safe_value":"New York Review Books"}]},"field_teaser_thumbnail":[],"path_alias":"winners\/joshua-cohen"},{"title":"Louise Erdrich","nid":"21286","type":"winner","changed":"1623362816","field_category":{"und":[{"tid":"219"}]},"field_year":{"und":[{"tid":"632"}]},"field_abbr_citation":{"und":[{"safe_value":"A majestic, polyphonic novel about a community\u2019s efforts to halt the proposed displacement and elimination of several Native American tribes in the 1950s, rendered with dexterity and imagination."}]},"field_location_text":[],"field_publication":{"und":[{"safe_value":"The Night Watchman"}]},"field_publisher":{"und":[{"safe_value":"Harper"}]},"field_teaser_thumbnail":[],"path_alias":"winners\/louise-erdrich"},

But I cannot seem to get the tags; I am trying:

# Import BeautifulSoup
from bs4 import BeautifulSoup as bs
content = []
# Read the XML file
with open("file.xml", "r") as file:
    # Read each line in the file
    content = file.readlines()
    # Combine the lines in the list into a string
    content = "".join(content)
    bs_content = bs(content, "lxml")

result = bs_content.find_all("title")
print(result)

But I only get an empty [] Appreciate any help!

Upvotes: -1

Views: 30

Answers (1)

HedgeHog
HedgeHog

Reputation: 25048

It is not XML its a JSON like structure, so simply iterate the list of dicts:

l = [{"title":"Joshua Cohen","nid":"21706","type":"winner","changed":"1651960857","field_category":{"und":[{"tid":"219"}]},"field_year":{"und":[{"tid":"640"}]},"field_abbr_citation":{"und":[{"safe_value":"A mordant, linguistically deft historical novel about the ambiguities of the Jewish-American experience, presenting ideas and disputes as volatile as its tightly-wound plot."}]},"field_location_text":[],"field_publication":{"und":[{"safe_value":"The Netanyahus: An Account of a Minor and Ultimately Even Negligible Episode in the History of a Very Famous Family"}]},"field_publisher":{"und":[{"safe_value":"New York Review Books"}]},"field_teaser_thumbnail":[],"path_alias":"winners\/joshua-cohen"},{"title":"Louise Erdrich","nid":"21286","type":"winner","changed":"1623362816","field_category":{"und":[{"tid":"219"}]},"field_year":{"und":[{"tid":"632"}]},"field_abbr_citation":{"und":[{"safe_value":"A majestic, polyphonic novel about a community\u2019s efforts to halt the proposed displacement and elimination of several Native American tribes in the 1950s, rendered with dexterity and imagination."}]},"field_location_text":[],"field_publication":{"und":[{"safe_value":"The Night Watchman"}]},"field_publisher":{"und":[{"safe_value":"Harper"}]},"field_teaser_thumbnail":[],"path_alias":"winners\/louise-erdrich"},]

for d in l:
    print(d['title'])

Or while you have a string just convert it before via json.loads():

import json

l = '[{"title":"Joshua Cohen","nid":"21706","type":"winner","changed":"1651960857","field_category":{"und":[{"tid":"219"}]},"field_year":{"und":[{"tid":"640"}]},"field_abbr_citation":{"und":[{"safe_value":"A mordant, linguistically deft historical novel about the ambiguities of the Jewish-American experience, presenting ideas and disputes as volatile as its tightly-wound plot."}]},"field_location_text":[],"field_publication":{"und":[{"safe_value":"The Netanyahus: An Account of a Minor and Ultimately Even Negligible Episode in the History of a Very Famous Family"}]},"field_publisher":{"und":[{"safe_value":"New York Review Books"}]},"field_teaser_thumbnail":[],"path_alias":"winners\/joshua-cohen"},{"title":"Louise Erdrich","nid":"21286","type":"winner","changed":"1623362816","field_category":{"und":[{"tid":"219"}]},"field_year":{"und":[{"tid":"632"}]},"field_abbr_citation":{"und":[{"safe_value":"A majestic, polyphonic novel about a community\u2019s efforts to halt the proposed displacement and elimination of several Native American tribes in the 1950s, rendered with dexterity and imagination."}]},"field_location_text":[],"field_publication":{"und":[{"safe_value":"The Night Watchman"}]},"field_publisher":{"und":[{"safe_value":"Harper"}]},"field_teaser_thumbnail":[],"path_alias":"winners\/louise-erdrich"}]'

for d in json.loads(l):
    print(d['title'])

Output:

Joshua Cohen
Louise Erdrich

Upvotes: 1

Related Questions