Reputation:
Ignore the below Texts Paragraph
XML code, a formal recommendation from the World Wide Web Consortium (W3C), is similar to Hypertext Markup Language (HTML). Both XML and HTML contain markup symbols to describe page or file contents. HTML code describes Web page content (mainly text and graphic images) only in terms of how it is to be displayed and interacted with.
XML data is known as self-describing or self-defining, meaning that the structure of the data is embedded with the data, thus when the data arrives there is no need to pre-build the structure to store the data; it is dynamically understood within the XML. The XML format can be used by any individual or group of individuals or companies that want to share information in a consistent way. XML is actually a simpler and easier-to-use subset of the Standard Generalized Markup Language (SGML), which is the standard to create a document structure.
So, for I used the below code to extract all 5 the fields.
import requests
from bs4 import BeautifulSoup
import lxml
soup = BeautifulSoup(contents,'lxml')
a=[v.get_text() for v in soup.select('cia')]
v=[v.get_text() for v in soup.select('civ')]
p=[v.get_text() for v in soup.select('cip')]
y=[v.get_text() for v in soup.select('ciy')]
t=[v.get_text() for v in soup.select('cit')]
print (a)
print (v)
print (p)
print (y)
print (t)
Upvotes: 0
Views: 217
Reputation: 195573
You can try something like this: make a generator, that will yield values from XML file stored in dictionary. The missing values from XML file will be stored as 'Blank' in this dictionary:
from bs4 import BeautifulSoup
data = """<CI_INFO>
<CI_JOURNAL>
<CI_AUTHOR>CAMPBELL D</CI_AUTHOR>
<CI_VOLUME>0079</CI_VOLUME>
<CI_PAGE>00034</CI_PAGE>
<CI_YEAR>2013</CI_YEAR>
<CI_TITLE> <![CDATA[ ALASKA MAGAZINE FEB ]]></CI_TITLE>
</CI_JOURNAL>
<CI_JOURNAL>
<CI_AUTHOR>BURKE CH</CI_AUTHOR>
<CI_YEAR>1961</CI_YEAR>
<CI_TITLE> <![CDATA[ DOCTOR HAP ]]> </CI_TITLE>
</CI_JOURNAL>
<CI_JOURNAL>
<CI_YEAR>1905</CI_YEAR>
<CI_TITLE> <![CDATA[ REPORT GOVERNOR ALAS ]]></CI_TITLE>
</CI_JOURNAL>
</CI_INFO>"""
def parse_data(soup):
_text = lambda soup, name: soup.find(name).text.strip() if soup.find(name) else 'Blank'
for j in soup.select('CI_JOURNAL'):
d = {}
d['author'] = _text(j, 'CI_AUTHOR')
d['vol'] = _text(j, 'CI_VOLUME')
d['page'] = _text(j, 'CI_PAGE')
d['year'] = _text(j, 'CI_YEAR')
d['title'] = _text(j, 'CI_TITLE')
yield d
for info in parse_data(BeautifulSoup(data, 'xml')):
print(info['author'])
print(info['vol'])
print(info['page'])
print(info['year'])
print(info['title'])
print('-' * 80)
This will print:
CAMPBELL D
0079
00034
2013
ALASKA MAGAZINE FEB
--------------------------------------------------------------------------------
BURKE CH
Blank
Blank
1961
DOCTOR HAP
--------------------------------------------------------------------------------
Blank
Blank
Blank
1905
REPORT GOVERNOR ALAS
--------------------------------------------------------------------------------
EDIT:
If you want separated columns, you can do this:
author, vol, page, year, title = [], [], [], [], []
for d in parse_data(BeautifulSoup(data, 'xml')):
author.append(d['author'])
vol.append(d['vol'])
page.append(d['page'])
year.append(d['year'])
title.append(d['title'])
print(author)
print(vol)
print(page)
print(year)
print(title)
This prints:
['CAMPBELL D', 'BURKE CH', 'Blank']
['0079', 'Blank', 'Blank']
['00034', 'Blank', 'Blank']
['2013', '1961', '1905']
['ALASKA MAGAZINE FEB', 'DOCTOR HAP', 'REPORT GOVERNOR ALAS']
EDIT:
For printing with '\t'
, you can use this code:
print('>\t' + str(author))
print('\t' + str(vol))
print('\t' + str(page))
print('\t' + str(year))
print('\t' + str(title))
This will print:
> ['CAMPBELL D', 'BURKE CH', 'Blank']
['0079', 'Blank', 'Blank']
['00034', 'Blank', 'Blank']
['2013', '1961', '1905']
['ALASKA MAGAZINE FEB', 'DOCTOR HAP', 'REPORT GOVERNOR ALAS']
Upvotes: 3