Reputation: 125912

How can I scrape an HTML table to CSV?

The Problem

I use a tool at work that lets me do queries and get back HTML tables of info. I do not have any kind of back-end access to it.

A lot of this info would be much more useful if I could put it into a spreadsheet for sorting, averaging, etc. How can I screen-scrape this data to a CSV file?

My First Idea

Since I know jQuery, I thought I might use it to strip out the table formatting onscreen, insert commas and line breaks, and just copy the whole mess into notepad and save as a CSV. Any better ideas?

The Solution

Yes, folks, it really was as easy as copying and pasting. Don't I feel silly.

Specifically, when I pasted into the spreadsheet, I had to select "Paste Special" and choose the format "text." Otherwise it tried to paste everything into a single cell, even if I highlighted the whole spreadsheet.

Upvotes: 44

Answers (12)

Kevin E

Reputation: 3326

If data inside the table is "rectangular"—no <caption>s, colspans, or rowspans— you may be able to use command-line tools (xidel, pup, jq) and skills you already know (XPath, CSS selectors).

Xidel is good for stuff like this. Often, as in this case, it's easier to use tab as the delimiter unless you must have CSV.

URL=https://nssdc.gsfc.nasa.gov/planetary/factsheet/
xidel $URL -e $'//table[1]//tr/string-join(td|th,"\t")'

Similarly, for rudimentary tables, you can achieve similar results combining curl, pup, and jq, without too much effort.

URL=https://nssdc.gsfc.nasa.gov/planetary/factsheet/moonfact.html
curl -s $URL \
  | pup 'table:nth-of-type(1) tr json{}' \
  | jq --raw-output 'map(.children | map(.text))[] | @tsv'  # or @csv

However, if a cell's text is so much inside an <a> or a <b> tag, more effort on the jq end would be required. More than I wanted to mess with for this example.

If the web site requires authentication, and that authentication isn't too complicated, then something like

curl -c cookies.txt -F username=you \
     -F password="$(read -sp 'Password> '; echo $REPLY)" \
     $POST_URL_FOR_LOGIN

# then
curl -b cookies.txt $URL_THAT_REQUIRES_AUTHENTICATION

…may allow you to proceed as above.

If the page content you're interested in targeting is generated by JavaScript, then nothing else suggested here will work. Designing your site so that it gracefully degrades without JavaScript seems like it would be an obvious best practice, but it's not widely adhered to.

#!/usr/bin/python

from bs4 import BeautifulSoup
import sys
import re
import csv
import grequests
import time

def cell_text(cell):
    return " ".join(cell.stripped_strings)

def parse_table(body_html):
    soup = BeautifulSoup(body_html)
    for table in soup.find_all('table'):
        for row in table.find_all('tr'):
            col = map(cell_text, row.find_all(re.compile('t[dh]')))
            print(col)

def process_a_page(response, *args, **kwargs): 
    parse_table(response.content)

def download_a_chunk(k):
    chunk_size = 10 #number of html pages
    x = "http://www.blahblah....com/inclusiones.php?p="
    x2 = "&name=..."
    URLS = [x+str(i)+x2 for i in range(k*chunk_size, k*(chunk_size+1)) ]
    reqs = [grequests.get(url, hooks={'response': process_a_page}) for url in URLS]
    resp = grequests.map(reqs, size=10)

# download slowly so the server does not block you
for k in range(0,500):
    print("downloading chunk ",str(k))
    download_a_chunk(k)
    time.sleep(11)

Upvotes: 1

Aviad

Reputation: 349

Basic Python implementation using BeautifulSoup, also considering both rowspan and colspan:

from BeautifulSoup import BeautifulSoup

def table2csv(html_txt):
   csvs = []
   soup = BeautifulSoup(html_txt)
   tables = soup.findAll('table')

   for table in tables:
       csv = ''
       rows = table.findAll('tr')
       row_spans = []
       do_ident = False

       for tr in rows:
           cols = tr.findAll(['th','td'])

           for cell in cols:
               colspan = int(cell.get('colspan',1))
               rowspan = int(cell.get('rowspan',1))

               if do_ident:
                   do_ident = False
                   csv += ','*(len(row_spans))

               if rowspan > 1: row_spans.append(rowspan)

               csv += '"{text}"'.format(text=cell.text) + ','*(colspan)

           if row_spans:
               for i in xrange(len(row_spans)-1,-1,-1):
                   row_spans[i] -= 1
                   if row_spans[i] < 1: row_spans.pop()

           do_ident = True if row_spans else False

           csv += '\n'

       csvs.append(csv)
       #print csv

   return '\n\n'.join(csvs)

Upvotes: 2

n8henrie

Reputation: 2965

Two ways come to mind (especially for those of us that don't have Excel):

Google Spreadsheets has an excellent importHTML function:
- =importHTML("http://example.com/page/with/table", "table", index
- Index starts at 1
- I recommend a copy and paste values shortly after import
- File -> Download as -> CSV
Python's superb Pandas library has handy read_html and to_csv functions
- Here's a basic Python3 script that prompts for the URL, which table at that URL, and a filename for the CSV.

Upvotes: 5

Juan A. Navarro

Reputation: 11035

This is my python version using the (currently) latest version of BeautifulSoup which can be obtained using, e.g.,

$ sudo easy_install beautifulsoup4

The script reads HTML from the standard input, and outputs the text found in all tables in proper CSV format.

#!/usr/bin/python
from bs4 import BeautifulSoup
import sys
import re
import csv

def cell_text(cell):
    return " ".join(cell.stripped_strings)

soup = BeautifulSoup(sys.stdin.read())
output = csv.writer(sys.stdout)

for table in soup.find_all('table'):
    for row in table.find_all('tr'):
        col = map(cell_text, row.find_all(re.compile('t[dh]')))
        output.writerow(col)
    output.writerow([])

Upvotes: 11

Thorvaldur

Reputation: 551

using python:

for example imagine you want to scrape forex quotes in csv form from some site like:fxquotes

then...

from BeautifulSoup import BeautifulSoup
import urllib,string,csv,sys,os
from string import replace

date_s = '&date1=01/01/08'
date_f = '&date=11/10/08'
fx_url = 'http://www.oanda.com/convert/fxhistory?date_fmt=us'
fx_url_end = '&lang=en&margin_fixed=0&format=CSV&redirected=1'
cur1,cur2 = 'USD','AUD'
fx_url = fx_url + date_f + date_s + '&exch=' + cur1 +'&exch2=' + cur1
fx_url = fx_url +'&expr=' + cur2 +  '&expr2=' + cur2 + fx_url_end
data = urllib.urlopen(fx_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('pre', limit=1))
data = replace(data,'[<pre>','')
data = replace(data,'</pre>]','')
file_location = '/Users/location_edit_this'
file_name = file_location + 'usd_aus.csv'
file = open(file_name,"w")
file.write(data)
file.close()

edit: to get values from a table: example from: palewire

from mechanize import Browser
from BeautifulSoup import BeautifulSoup

mech = Browser()

url = "http://www.palewire.com/scrape/albums/2007.html"
page = mech.open(url)

html = page.read()
soup = BeautifulSoup(html)

table = soup.find("table", border=1)

for row in table.findAll('tr')[1:]:
    col = row.findAll('td')

    rank = col[0].string
    artist = col[1].string
    album = col[2].string
    cover_link = col[3].img['src']

    record = (rank, artist, album, cover_link)
    print "|".join(record)

Upvotes: 13

dkretz

Reputation: 37655

Even easier (because it saves it for you for next time) ...

In Excel

Data/Import External Data/New Web Query

will take you to a url prompt. Enter your url, and it will delimit available tables on the page to import. Voila.

Upvotes: 5

Christian Payne

Reputation: 7169

Excel can open a http page.

Eg:

Click File, Open
Under filename, paste the URL ie: How can I scrape an HTML table to CSV?
Click ok

Excel does its best to convert the html to a table.

Its not the most elegant solution, but does work!

Upvotes: 2

andy

Reputation: 11

If you're screen scraping and the table you're trying to convert has a given ID, you could always do a regex parse of the html along with some scripting to generate a CSV.

Upvotes: 0

Will Rickards

Reputation: 2786

Have you tried opening it with excel? If you save a spreadsheet in excel as html you'll see the format excel uses. From a web app I wrote I spit out this html format so the user can export to excel.

Upvotes: 0

James Van Huis

Reputation: 5571

Quick and dirty:

Copy out of browser into Excel, save as CSV.

Better solution (for long term use):

Write a bit of code in the language of your choice that will pull the html contents down, and scrape out the bits that you want. You could probably throw in all of the data operations (sorting, averaging, etc) on top of the data retrieval. That way, you just have to run your code and you get the actual report that you want.

It all depends on how often you will be performing this particular task.

Upvotes: 2

How can I scrape an HTML table to CSV?

The Problem

My First Idea

The Solution

Answers (12)

See also:

Related Questions