Reputation:
How do I convert the contents of an HTML table (<table>
) to CSV format? Is there a library or linux program that does this? This is similar to copy tables in Internet Explorer, and pasting them into Excel.
Upvotes: 76
Views: 177919
Reputation: 161
Considering that:
We can require a script that accepts a URI and converts each HTML table to CSV. For this purpose I advice to use XPath and whatever programming language of your choice that implements an API for it.
The following is a Bash script that calls a PHP script (note #!/usr/bin/env php
on first line).
#!/usr/bin/env php
<?php
$input_file = $argv[1];
$html_string = file_get_contents($input_file)
or exit("Ops! unable to open file: ($input_file)");
$dom = new DOMDocument();
//@$dom->loadHTMLFile( $input_file );
@$dom->loadHTML($html_string);
$xpath = new DOMXPath($dom);
$base_name = basename($input_file, ".html");
$tables = $dom->getElementsByTagName('table');
for ($i=0, $n = $tables->length; $i < $n; ++$i) {
$file_csv = fopen("{$base_name}_table_".($i+1).".csv", 'w');
$rows = $xpath->query(".//tr", $tables->item($i));
foreach ($rows as $row) {
$arr = array();
$cells = $xpath->query("td|th", $row);
foreach ($cells as $cell) {
$arr[] = trim($cell->textContent);
}
fputcsv($file_csv, $arr);
unset($arr);
} //rows
fclose($file_csv);
} //tables
?>
Save the script as html2csv
and grant execute permissions:
$ chmod +x html2csv
Type ./html2csv URL
or ./html2csv file.html
Upvotes: 0
Reputation: 22570
Here is an example using pQuery and Spreadsheet::WriteExcel:
use strict;
use warnings;
use Spreadsheet::WriteExcel;
use pQuery;
my $workbook = Spreadsheet::WriteExcel->new( 'data.xls' );
my $sheet = $workbook->add_worksheet;
my $row = 0;
pQuery( 'http://www.blahblah.site' )->find( 'tr' )->each( sub{
my $col = 0;
pQuery( $_ )->find( 'td' )->each( sub{
$sheet->write( $row, $col++, $_->innerHTML );
});
$row++;
});
$workbook->close;
The example simply extracts all tr tags that it finds into an excel file. You can easily tailor it to pick up specific table or even trigger a new excel file per table tag.
Further things to consider:
To see if rowspan or colspan is being used you can:
pQuery( $data )->find( 'td' )->each( sub{
my $number_of_cols_spanned = $_->getAttribute( 'colspan' );
});
Upvotes: 1
Reputation: 8596
This is based on atomicules' answer but more succinct and also processes th
(header) cells as well as td
cells. I also added the strip
method to get rid of the extra whitespaces.
CSV.open("output.csv", 'w') do |csv|
doc.xpath('//table//tr').each do |row|
csv << row.xpath('th|td').map {|cell| cell.text.strip}
end
end
Wrapping the code inside the CSV block ensures that the file will be closed properly.
If you just want the text and don't need to write it to a file, you can use this:
doc.xpath('//table//tr').inject('') do |result, row|
result << row.xpath('th|td').map {|cell| cell.text.strip}.to_csv
end
Upvotes: 1
Reputation: 2118
Here's the approach I took using only tr
and sed
:
< table.txt tr -d '\n' |
sed -e 's/<tr[^>]*>/\n/g' -e 's/<[^>]*t[dh]>/,/g' -e 's/<[^>]*>//g'
Explanation
tr -d '\n'
delete newlines's/<tr[^>]*>/\n/g'
convert tr tags into newlines to break data into table rows's/<[^>]*t[dh]>/,/g'
convert closing td/th tags into commas's/<[^>]*>//g'
delete all other html tagsSample input
(from an Outlook email that attempted to render an HTML table using MsoNormal):
<table class=3D"MsoNormalTable" border=3D"0" cellspacing=3D"0" cellpadding=3D"0" width=3D"420" style=3D"width:315.0pt;border-collapse:collapse">
<tbody>
<tr style=3D"height:15.0pt">
<td width=3D"107" nowrap=3D"" style=3D"width:80.0pt;padding:0in 0in 0in 0in;height:15.0pt">
</td>
<td width=3D"107" nowrap=3D"" valign=3D"bottom" style=3D"width:80.0pt;padding:0in 0in 0in 0in;height:15.0pt">
</td>
<td width=3D"64" nowrap=3D"" valign=3D"bottom" style=3D"width:48.0pt;padding:0in 0in 0in 0in;height:15.0pt">
</td>
<td width=3D"79" nowrap=3D"" valign=3D"bottom" style=3D"width:59.0pt;padding:0in 0in 0in 0in;height:15.0pt">
</td>
<td width=3D"64" nowrap=3D"" valign=3D"bottom" style=3D"width:48.0pt;padding:0in 0in 0in 0in;height:15.0pt">
</td>
</tr>
<tr style=3D"height:6.75pt">
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:6.75pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:6.75pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:6.75pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:6.75pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:6.75pt"></td>
</tr>
<tr style=3D"height:15.0pt">
<td nowrap=3D"" valign=3D"bottom" style=3D"border:solid windowtext 1.0pt;padding:0in 0in 0in 0in;height:15.0pt">
<p class=3D"MsoNormal" align=3D"center" style=3D"text-align:center"><b><span style=3D"color:black">ID</span></b><b><span style=3D"color:black"><o:p></o:p></span></b></p>
</td>
<td nowrap=3D"" valign=3D"bottom" style=3D"border:solid windowtext 1.0pt;border-left:none;padding:0in 0in 0in 0in;height:15.0pt">
<p class=3D"MsoNormal" align=3D"center" style=3D"text-align:center"><b><span style=3D"color:black">Price<o:p></o:p></span></b></p>
</td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
</tr>
<tr style=3D"height:15.0pt">
<td nowrap=3D"" valign=3D"bottom" style=3D"border:solid windowtext 1.0pt;border-top:none;padding:0in 0in 0in 0in;height:15.0pt">
<p class=3D"MsoNormal" align=3D"center" style=3D"text-align:center"><span style=3D"color:black">064159Q</span><span style=3D"color:black"><o:p></o:p></span></p>
</td>
<td nowrap=3D"" valign=3D"bottom" style=3D"border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:0in 0in 0in 0in;height:15.0pt">
<p class=3D"MsoNormal" align=3D"center" style=3D"text-align:center"><span style=3D"color:black">121.85<o:p></o:p></span></p>
</td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
</tr>
<tr style=3D"height:15.0pt">
<td nowrap=3D"" valign=3D"bottom" style=3D"border:solid windowtext 1.0pt;border-top:none;padding:0in 0in 0in 0in;height:15.0pt">
<p class=3D"MsoNormal" align=3D"center" style=3D"text-align:center"><span style=3D"color:black">2420128</span><span style=3D"color:black"><o:p></o:p></span></p>
</td>
<td nowrap=3D"" valign=3D"bottom" style=3D"border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:0in 0in 0in 0in;height:15.0pt">
<p class=3D"MsoNormal" align=3D"center" style=3D"text-align:center"><span style=3D"color:black">10.00<o:p></o:p></span></p>
</td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
<td nowrap=3D"" valign=3D"bottom" style=3D"padding:0in 0in 0in 0in;height:15.0pt"></td>
</tr>
</tbody>
</table>
Sample output
,,,,,
,,,,,
ID,Price,,,,
064159Q,121.85,,,,
2420128,10.00,,,,
See Non-greedy regex matching in sed for a discussion of the approach.
Upvotes: 4
Reputation: 141
Here's an updated version of Yuvai's answer, which properly handles fields that require quoting (i.e. fields that contain commas in the data, double quotes, or span multiple lines)
#!/usr/bin/env python3
from html.parser import HTMLParser
import sys
import re
class HTMLTableParser(HTMLParser):
def __init__(self, row_delim="\n", cell_delim=","):
HTMLParser.__init__(self)
self.despace_re = re.compile("\s+")
self.data_interrupt = False
self.first_row = True
self.first_cell = True
self.in_cell = False
self.row_delim = row_delim
self.cell_delim = cell_delim
self.quote_buffer = False
self.buffer = None
def handle_starttag(self, tag, attrs):
self.data_interrupt = True
if tag == "table":
self.first_row = True
self.first_cell = True
elif tag == "tr":
if not self.first_row:
sys.stdout.write(self.row_delim)
self.first_row = False
self.first_cell = True
self.data_interrupt = False
elif tag == "td" or tag == "th":
if not self.first_cell:
sys.stdout.write(self.cell_delim)
self.first_cell = False
self.data_interrupt = False
self.in_cell = True
elif tag == "br":
self.quote_buffer = True
self.buffer += self.row_delim
def handle_endtag(self, tag):
self.data_interrupt = True
if tag == "td" or tag == "th":
self.in_cell = False
if self.buffer != None:
# Quote if needed...
if self.quote_buffer or self.cell_delim in self.buffer or "\"" in self.buffer:
# Need to quote! First, replace all double-quotes with quad-quotes
self.buffer = self.buffer.replace("\"", "\"\"")
self.buffer = "\"{0}\"".format(self.buffer)
sys.stdout.write(self.buffer)
self.quote_buffer = False
self.buffer = None
def handle_data(self, data):
if self.in_cell:
#if self.data_interrupt:
# sys.stdout.write(" ")
if self.buffer == None:
self.buffer = ""
self.buffer += self.despace_re.sub(" ", data).strip()
self.data_interrupt = False
parser = HTMLTableParser()
parser.feed(sys.stdin.read())
One enhancement for this script could be to add support for specifying a different line delimiter (or auto-calculate the platform-correct one), and a different column delimiter.
Upvotes: 1
Reputation: 809
This is a very old thread, but may be someone like me will bump into it. I have made some additions for the audiodude's script to read the html from file instead adding it to the code, and another parameter that controls printing of the header lines.
the script should be run like that
ruby <script_name> <file_name> [<print_headers>]
the code is:
require 'nokogiri'
print_header_lines = ARGV[1]
File.open(ARGV[0]) do |f|
table_string=f
doc = Nokogiri::HTML(table_string)
doc.xpath('//table//tr').each do |row|
if print_header_lines
row.xpath('th').each do |cell|
print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
end
end
row.xpath('td').each do |cell|
print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
end
print "\n"
end
end
Upvotes: 2
Reputation: 2213
Based on audiodude's answer, but simplified by using the built-in CSV library
require 'nokogiri'
require 'csv'
doc = Nokogiri::HTML(table_string)
csv = CSV.open("output.csv", 'w')
doc.xpath('//table//tr').each do |row|
tarray = [] #temporary array
row.xpath('td').each do |cell|
tarray << cell.text #Build array of that row of data.
end
csv << tarray #Write that row out to csv file
end
csv.close
I did wonder if there was any way to take the Nokogiri NodeSet (row.xpath('td')
) and write this out as an array to the csv file in one step. But I could only figure out doing it by iterating over each cell and building the temporary array of each cell's content.
Upvotes: 4
Reputation: 3443
Here's a short Python program I wrote to complete this task. It was written in a couple of minutes, so it can probably be made better. Not sure how it'll handle nested tables (probably it'll do bad stuff) or multiple tables (probably they'll just appear one after another). It doesn't handle colspan
or rowspan
.
Enjoy.
from HTMLParser import HTMLParser
import sys
import re
class HTMLTableParser(HTMLParser):
def __init__(self, row_delim="\n", cell_delim="\t"):
HTMLParser.__init__(self)
self.despace_re = re.compile(r'\s+')
self.data_interrupt = False
self.first_row = True
self.first_cell = True
self.in_cell = False
self.row_delim = row_delim
self.cell_delim = cell_delim
def handle_starttag(self, tag, attrs):
self.data_interrupt = True
if tag == "table":
self.first_row = True
self.first_cell = True
elif tag == "tr":
if not self.first_row:
sys.stdout.write(self.row_delim)
self.first_row = False
self.first_cell = True
self.data_interrupt = False
elif tag == "td" or tag == "th":
if not self.first_cell:
sys.stdout.write(self.cell_delim)
self.first_cell = False
self.data_interrupt = False
self.in_cell = True
def handle_endtag(self, tag):
self.data_interrupt = True
if tag == "td" or tag == "th":
self.in_cell = False
def handle_data(self, data):
if self.in_cell:
#if self.data_interrupt:
# sys.stdout.write(" ")
sys.stdout.write(self.despace_re.sub(' ', data).strip())
self.data_interrupt = False
parser = HTMLTableParser()
parser.feed(sys.stdin.read())
Upvotes: 13
Reputation: 2820
Here's a ruby script that uses nokogiri -- http://nokogiri.rubyforge.org/nokogiri/
require 'nokogiri'
doc = Nokogiri::HTML(table_string)
doc.xpath('//table//tr').each do |row|
row.xpath('td').each do |cell|
print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
end
print "\n"
end
Worked for my basic test case.
Upvotes: 26
Reputation: 52506
Here's a method that uses pup and jq.
Assuming that infile.html
contains one <table>
element, we can select its rows using pup, and convert to JSON:
pup 'table tr json{}' --file infile.html
This returns an array of objects, with a children
array for each row. For an example with a header row, two data rows, and three columns:
[
{
"children": [
{ "tag": "th", "text": "ID" },
{ "tag": "th", "text": "First name" },
{ "tag": "th", "text": "Last name" }
],
"tag": "tr"
},
{
"children": [
{ "tag": "td", "text": "123" },
{ "tag": "td", "text": "Anna" },
{ "tag": "td", "text": "Alphabet" }
],
"tag": "tr"
},
{
"children": [
{ "tag": "td", "text": "456" },
{ "tag": "td", "text": "Brandon" },
{ "tag": "td", "text": "Betazoid" }
],
"tag": "tr"
}
]
To convert that to CSV, we can use jq (see snippet):
pup 'table tr json{}' --file infile.html \
| jq --raw-output 'map(.children | map(.text))[] | @csv'
resulting in
"ID","First name","Last name"
"123","Anna","Alphabet"
"456","Brandon","Betazoid"
Upvotes: 5
Reputation: 316
you can convert html to csv using libreoffice or sed
libreoffice:
mkdir in out
cp -v *.html in
rename 's/([^.]+).html/$1.xls/g' in/*.html
## 59 is ;
## 44 is ,
libreoffice --convert-to 'csv:Text - txt - csv (StarCalc):59,,0,3' in/*.xls --outdir out
or sed:
mkdir out
cp -v *.html out
sed -i ':a;N;$!ba
s/<html.\+<table[^>]\+>//Ig
s#\s*</td>\s*</tr>\s*<tr>\s*<td>\s*#\n#Ig
s#\s*</td>\s*<td>\s*#;#Ig
s/<[^>]\+>//g;s/\s\{2,\}//g' out/*.html
rename 's/([^.]+).html/$1.csv/g' out/*.html
An example can be found in the online bash "sandbox": https://onlinegdb.com/1oivp0uGm
Upvotes: 1
Reputation: 1
Depending on what you need you can simply:
var table ='';var selector='#customers';
document.querySelectorAll(`${selector} tr th`).forEach(h=>table+=`${h.innerText.trim()};`);table=table.trim();table+='\r\n';
document.querySelectorAll(`${selector} tr`).forEach(tr=>{tr.querySelectorAll('td').forEach(td=>table+=`${td.innerText.trim()};`);table+='\r\n';});
change "selector" to target your table and after executing "table" will have the contents of your csv
aditionally you can:
var a = document.createElement('a');a.href=`data:text/csv;base64,${btoa(table)}`;a.download="table.csv";a.click();
to download the contents of "table"
Upvotes: 0
Reputation: 47611
CSV
and nokogiri
to Output to .csv
.Based on @audiodude's answer but modified in the following ways:
CSV
's built-in library for converting an Array
into a CSV row..csv
file instead of just printing to STDOUT
.th
) and the table body (td
).# Convert HTML table to CSV format.
require "nokogiri"
html_file_path = ""
html_string = File.read( html_file_path )
doc = Nokogiri::HTML( html_string )
CSV.open( Rails.root.join( Time.zone.now.to_s( :file ) + ".csv" ), "wb" ) do |csv|
doc.xpath( "//table//tr" ).each do |row|
csv << row.xpath( "th|td" ).collect( &:text ).collect( &:strip )
end
end
Upvotes: 0
Reputation: 161
Assuming that you've designed an HTML page containing a table
, I would recommend this solution. Worked like charm for me:
$(document).ready(() => {
$("#buttonExport").click(e => {
// Getting values of current time for generating the file name
const dateTime = new Date();
const day = dateTime.getDate();
const month = dateTime.getMonth() + 1;
const year = dateTime.getFullYear();
const hour = dateTime.getHours();
const minute = dateTime.getMinutes();
const postfix = `${day}.${month}.${year}_${hour}.${minute}`;
// Creating a temporary HTML link element (they support setting file names)
const downloadElement = document.createElement('a');
// Getting data from our `div` that contains the HTML table
const dataType = 'data:application/vnd.ms-excel';
const tableDiv = document.getElementById('divData');
const tableHTML = tableDiv.outerHTML.replace(/ /g, '%20');
// Setting the download source
downloadElement.href = `${dataType},${tableHTML}`;
// Setting the file name
downloadElement.download = `exported_table_${postfix}.xls`;
// Trigger the download
downloadElement.click();
// Just in case, prevent default behaviour
e.preventDefault();
});
});
Courtesy: http://www.kubilayerdogan.net/?p=218
You can edit the file format to .csv
here:
downloadElement.download = `exported_table_${postfix}.csv`;
Upvotes: 6
Reputation: 894
Here a simple solution without any external lib:
https://www.codexworld.com/export-html-table-data-to-csv-using-javascript/
It works for me without any issue
Upvotes: 5
Reputation: 131
Just to add to these answers (as i've recently been attempting a similar thing) - if Google spreadsheets is your spreadsheeting program of choice. Simply do these two things.
1. Strip everything out of your html file around the Table opening/closing tags and resave it as another html file.
2. Import that html file directly into google spreadsheets and you'll have your information beautifully imported (Top tip: if you used inline styles in your table, they will be imported as well!)
Saved me loads of time and figuring out different conversions.
Upvotes: 13
Reputation: 1936
Sorry for resurrecting an ancient thread, but I recently wanted to do this, but I wanted a 100% portable bash script to do it. So here's my solution using only grep and sed.
The below was bashed out very quickly, and so could be made much more elegant, but I'm just getting started really with sed/awk etc...
curl "http://www.webpagewithtableinit.com/" 2>/dev/null | grep -i -e '</\?TABLE\|</\?TD\|</\?TR\|</\?TH' | sed 's/^[\ \t]*//g' | tr -d '\n' | sed 's/<\/TR[^>]*>/\n/Ig' | sed 's/<\/\?\(TABLE\|TR\)[^>]*>//Ig' | sed 's/^<T[DH][^>]*>\|<\/\?T[DH][^>]*>$//Ig' | sed 's/<\/T[DH][^>]*><T[DH][^>]*>/,/Ig'
As you can see I've got the page source using curl, but you could just as easily feed in the table source from elsewhere.
Here's the explanation:
Get the Contents of the URL using cURL, dump stderr to null (no progress meter)
curl "http://www.webpagewithtableinit.com/" 2>/dev/null
.
I only want Table elements (return only lines with TABLE,TR,TH,TD tags)
| grep -i -e '</\?TABLE\|</\?TD\|</\?TR\|</\?TH'
.
Remove any Whitespace at the beginning of the line.
| sed 's/^[\ \t]*//g'
.
Remove newlines
| tr -d '\n\r'
.
Replace </TR>
with newline
| sed 's/<\/TR[^>]*>/\n/Ig'
.
Remove TABLE and TR tags
| sed 's/<\/\?\(TABLE\|TR\)[^>]*>//Ig'
.
Remove ^<TD>
, ^<TH>
, </TD>$
, </TH>$
| sed 's/^<T[DH][^>]*>\|<\/\?T[DH][^>]*>$//Ig'
.
Replace </TD><TD>
with comma
| sed 's/<\/T[DH][^>]*><T[DH][^>]*>/,/Ig'
.
Note that if any of the table cells contain commas, you may need to escape them first, or use a different delimiter.
Hope this helps someone!
Upvotes: 33
Reputation: 41644
With Perl you can use the HTML::TableExtract
module to extract the data from the table and then use Text::CSV_XS
to create a CSV file or Spreadsheet::WriteExcel
to create an Excel file.
Upvotes: 5
Reputation: 11
OpenOffice.org can view HTML tables. Simply use the open command on the HTML file, or select and copy the table in your browser and then Paste Special in OpenOffice.org. It will query you for the file type, one of which should be HTML. Select that and voila!
Upvotes: 1
Reputation: 1853
I'm not sure if there is pre-made library for this, but if you're willing to get your hands dirty with a little Perl, you could likely do something with Text::CSV
and HTML::Parser
.
Upvotes: 6
Reputation: 5176
here's a few options
http://groups.google.com/group/ruby-talk-google/browse_thread/thread/cfae0aa4b14e5560?hl=nn
http://ouseful.wordpress.com/2008/10/14/data-scraping-wikipedia-with-google-spreadsheets/
How can I scrape an HTML table to CSV?
https://addons.mozilla.org/en-US/firefox/addon/1852
Upvotes: 2
Reputation: 15128
This method is not really a library OR a program, but for ad hoc conversions you can
I know this works with Excel, and I believe I've done it with the OpenOffice spreadsheet.
But you probably would prefer a Perl or Ruby script...
Upvotes: 83