Reputation: 389
I am looking to extract some parts of data rendered on a web page. I am able to pull the entire data from the page and save it in a text file (RAW) using the code below.
curl http://webpage -o "raw.txt"
Just wondering if there were other alternatives and advantages whatsoever.
Upvotes: 9
Views: 74667
Reputation: 131978
I would use a combination of requests, and BeautifulSoup.
from bs4 import BeautifulSoup
import requests
session = requests.session()
req = session.get('http://stackoverflow.com/questions/10807081/script-to-extract-data-from-wbpage')
doc = BeautifulSoup(req.content)
print(doc.findAll('a', { "class" : "gp-share" }))
Upvotes: 7
Reputation: 11
Save / Process a Single Web Resource: The above approach is good for a single file/web resource. Also you can pipe a regex and chop/skip data based on a preset pattern. eg: save all the tag source urls.
Save / Process Entire Directory or a Website Recursively: Use a Python or Perl script which can iteratively pull down all the links and resources belonging to a page or a website dns name. In Python I would use http lib and parse the tags recursively (make sure to have a depth limit or with a large website you might end up saving gigs of data!). An easy and safe bet is Beautiful Soup - which is a Python library that can scrap web data, navigate, seearch a parse tree of a remote web resource. Also it can modify the parsed local contents, etc.
Upvotes: 1
Reputation: 185005
cURL is a good start. A better command line will be :
curl -A "Mozilla/5.0" -L -k -b /tmp/c -c /tmp/c -s http://url.tld
because it plays with cookies, user-agent, SSL certificates and others things.
See man curl
Upvotes: 1
Reputation: 2061
Your example code will fetch all data from the web page. If you want to parse the web page and extract specific information I suggest that you use some existing parser.
I usually use BeautifulSoup for extracting data from html pages.
Upvotes: 0