Scraping XML element attributes with beautifulsoup

Question

I have the following code:

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://api.stlouisfed.org/fred/...")
bsObj = BeautifulSoup(html.read(), "lxml");

print(bsObj)

It returns something like this:

I want to extract only the "date" and the "value" so finaly I have something like this:

1947-04-01 -0.4
1947-07-01 -0.4
1947-10-01 6.4
1948-01-01 6
and so on...

so far I'm using replace to scrape the text and import csv for the csv file:

string = str(bsObj)

string = string.replace("realtime_start=","")
string = string.replace("realtime_end=","")
string = string.replace("observation","")
string = string.replace("date=","")
string = string.replace('"2016-06-22"',"")
string = string.replace("value=","")
string = string.replace("<","")
string = string.replace(">","")
string = string.replace("/","")
string = string.replace('"',"")
print(string)

import csv
with open('test.csv', 'w', newline='') as fp:
    a = csv.writer(fp, delimiter=',')
    data = string
    a.writerows(data)

This one though is almost disaster. It push the text in to the csv but every simbol is going on to new row.

I want to know if there is any more elegant way I can extract what I need. For example:

for line in f:
   extract "date" and "value"

or similar. And what is the most apropriate way to insert it in to .csv file? I'll be rewriting the .csv file every time I call this script. The fields have to be separated by "," and the lines by "/n".

Padraic Cunningham · Accepted Answer

Find all the attribute tags and just extract the attributes you want:

x = """















"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(x,"lxml")

for ob in soup.find_all("observation"):
    print(ob["date"])
    print(ob["value"])

Which will give you:

1947-04-01
-0.4
1947-07-01
-0.4
1947-10-01
6.4
1948-01-01
6
1948-04-01
6.7
1948-07-01
2.3
1948-10-01
0.4
1949-01-01
-5.4
1949-04-01
-1.3
1949-07-01
4.5
1949-10-01
-3.5
1950-01-01
16.9
1950-04-01
12.7
1950-07-01
16.3

To write to a csv:

from bs4 import BeautifulSoup
import csv

soup = BeautifulSoup(x, "lxml")
with open("out.csv", "w") as f:
    csv.writer(f).writerows((ob["date"], ob["value"])
                            for ob in soup.find_all("observation"))

Which gives you a csv file with:

1947-04-01,-0.4
1947-07-01,-0.4
1947-10-01,6.4
1948-01-01,6
1948-04-01,6.7
1948-07-01,2.3
1948-10-01,0.4
1949-01-01,-5.4
1949-04-01,-1.3
1949-07-01,4.5
1949-10-01,-3.5
1950-01-01,16.9
1950-04-01,12.7
1950-07-01,16.3

Scraping XML element attributes with beautifulsoup

Answers (2)

Related Questions