Reputation: 123
I am trying to export my data as a .txt file
from bs4 import BeautifulSoup
import requests
import os
import os
os.getcwd()
'/home/folder'
os.mkdir("Probeersel6")
os.chdir("Probeersel6")
os.getcwd()
'/home/Desktop/folder'
os.mkdir("img") #now `folder`
url = "http://nos.nl/artikel/2093082-steeds-meer-nekklachten-bij-kinderen-door-gebruik-tablets.html"
r = requests.get(url)
soup = BeautifulSoup(r.content)
data = soup.find_all("article", {"class": "article"})
with open(""%s".txt", "wb" %(url)) as file:
for item in data:
print item.contents[0].find_all("time", {"datetime": "2016-03-16T09:50:30+0100"})[0].text
print item.contents[0].find_all("a", {"class": "link-grey"})[0].text
print "\n"
print item.contents[0].find_all("img", {"class": "media-full"})[0]
print "\n"
print item.contents[1].find_all("div", {"class": "article_textwrap"})[0].text
file.write()
what should be put in the:
file.write()
to work?
I am also trying to get the name of the .txt file the same as the url should I do that with a string?
with open(""%s".txt", "wb" %(url)) as file:
url = "http://nos.nl/artikel/2093082-steeds-meer-nekklachten-bij-kinderen-door-gebruik-tablets.html"
Upvotes: 5
Views: 22434
Reputation:
You should put Inside file.write
your content. I'll probably do something like:
#!/usr/bin/python3
#
from bs4 import BeautifulSoup
import requests
url = 'http://nos.nl/artikel/2093082-steeds-meer-nekklachten-bij-kinderen-door-gebruik-tablets.html'
file_name=url.rsplit('/',1)[1].rsplit('.')[0]
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
data = soup.find_all('article', {'class': 'article'})
content=''.join('''{}\n{}\n\n{}\n{}'''.format( item.contents[0].find_all('time', {'datetime': '2016-03-16T09:50:30+0100'})[0].text,
item.contents[0].find_all('a', {'class': 'link-grey'})[0].text,
item.contents[0].find_all('img', {'class': 'media-full'})[0],
item.contents[1].find_all('div', {'class': 'article_textwrap'})[0].text,
) for item in data)
with open('./{}.txt'.format(file_name), mode='wt', encoding='utf-8') as file:
file.write(content)
Upvotes: 5
Reputation: 1247
I was working on a webscraping project, and this issue gave me tons of problems. I tried almost every solution out there that dealt with Python encoding (convert to UTF using string.encode(), convert to ASCII, convert using the 'unicodedata' module, use .decode() and then .encode(), blood sacrifice to Tim Peters, etc etc).
None of the solutions worked all the time, which struck me as very un-Pythonic.
So what I ended up using was the following:
html = bs.prettify() #bs is your BeautifulSoup object
with open("out.txt","w") as out:
for i in range(0, len(html)):
try:
out.write(html[i])
except Exception:
1+1
It's not perfect, but it gave me the best results. When I opened it in a browser, it was able to parse the page properly almost every time.
Upvotes: 0