Selase
Selase

Reputation: 389

Script to Extract data from web page

I am looking to extract some parts of data rendered on a web page. I am able to pull the entire data from the page and save it in a text file (RAW) using the code below.

curl http://webpage -o "raw.txt"

Just wondering if there were other alternatives and advantages whatsoever.

Upvotes: 9

Views: 74667

Answers (4)

sberry
sberry

Reputation: 131978

I would use a combination of requests, and BeautifulSoup.

from bs4 import BeautifulSoup
import requests    
    
session = requests.session()    
req = session.get('http://stackoverflow.com/questions/10807081/script-to-extract-data-from-wbpage')    
doc = BeautifulSoup(req.content)    
print(doc.findAll('a', { "class" : "gp-share" }))

Upvotes: 7

Syam Sathyan
Syam Sathyan

Reputation: 11

  1. Save / Process a Single Web Resource: The above approach is good for a single file/web resource. Also you can pipe a regex and chop/skip data based on a preset pattern. eg: save all the tag source urls.

  2. Save / Process Entire Directory or a Website Recursively: Use a Python or Perl script which can iteratively pull down all the links and resources belonging to a page or a website dns name. In Python I would use http lib and parse the tags recursively (make sure to have a depth limit or with a large website you might end up saving gigs of data!). An easy and safe bet is Beautiful Soup - which is a Python library that can scrap web data, navigate, seearch a parse tree of a remote web resource. Also it can modify the parsed local contents, etc.

Upvotes: 1

Gilles Quénot
Gilles Quénot

Reputation: 185005

cURL is a good start. A better command line will be :

curl -A "Mozilla/5.0" -L -k -b /tmp/c -c /tmp/c -s http://url.tld

because it plays with cookies, user-agent, SSL certificates and others things.

See man curl

Upvotes: 1

HAL
HAL

Reputation: 2061

Your example code will fetch all data from the web page. If you want to parse the web page and extract specific information I suggest that you use some existing parser.

I usually use BeautifulSoup for extracting data from html pages.

Upvotes: 0

Related Questions