Reputation: 1111
I have a html table which I would like to parse out with bash (Note: I have used R to do this but would like to attempt in bash to easily integrate with another shell script).
The table can be obtained from the url below: http://faostat.fao.org/site/384/default.aspx
By reviewing source - the xpath reference for the specific table is:
//*[@id="ctl03_DesktopThreePanes1_ThreePanes_ctl01_MDlisting"]
How can I parse out this table to a csv file from bash directly?
I tried the following:
curl "http://faostat.fao.org/site/384/default.aspx" | xpath '//*[@id="ctl03_DesktopThreePanes1_ThreePanes_ctl01_MDlisting"]' > test.txt
This only returns a blank text for test.txt.
Could anyone please help me in parsing out the valid html table using xpath in bash and creating a CSV file of it?
Any help appreciated.
Upvotes: 1
Views: 948
Reputation: 87984
//*[@id="ctl03_DesktopThreePanes1_ThreePanes_ctl01_MDlisting"]/tr
(that is, with /tr
appended to the XPath expression you had in your question) will grab just each row, and skip the table
wrapper (which you don’t need to do anything with in your output).
Then you also need to pipe that xmllint --xpath
output through sed
or perl
or something:
wget -q -O - "http://faostat.fao.org/site/384/default.aspx" \
| xmllint --html \
--xpath '//*[@id="ctl03_DesktopThreePanes1_ThreePanes_ctl01_MDlisting"]/*' - \
2>/dev/null \
| perl -pe 's/<tr[^>]+>//' \
| perl -pe 's/<\/tr>//' \
| perl -pe 's/^\s+<t[dh][^>]*>//' \
| perl -pe 's/<\/t[dh]><t[dh][^>]*>/|/g' \
| perl -pe 's/<\/t[dh]>//' \
| grep -v '^\s*$'
Example: sed version
wget -q -O - "http://faostat.fao.org/site/384/default.aspx" \
| xmllint --html \
--xpath '//*[@id="ctl03_DesktopThreePanes1_ThreePanes_ctl01_MDlisting"]/*' - \
2>/dev/null \
| sed -E 's/<tr[^>]+>//' \
| sed -E 's/<\/tr>//' \
| sed -E 's/^[[:space:]]+<t[dh][^>]*>//' \
| sed -E 's/<\/t[dh]><t[dh][^>]*>/|/g' \
| sed -E 's/<\/t[dh]>//' \
| grep -v '^\s*$'
In both cases, the grep -v '^\s*$'
is there just to remove blank lines.
It’s not strictly CSV; it separates the fields/cells with a |
(pipe) character, rather than a comma—because some (many) of the fields themselves also have commas and quotation marks in them. If you really true CSV, than scroll down and read How to generate true CSV for this case below.
As an alternative to xmllint --xpath
, you can instead use Python and the lxml.html
library:
wget -q -O - "http://faostat.fao.org/site/384/default.aspx" \
| python -c "import lxml.html as html; import sys; \
expr = sys.argv[1]; print '\n'.join([html.tostring(el) \
for el in html.parse(sys.stdin).xpath(expr)])" \
'//*[@id="ctl03_DesktopThreePanes1_ThreePanes_ctl01_MDlisting"]//tr' \
| sed -E 's/<tr[^>]+>//' \
| sed -E 's/<\/tr>//' \
| sed -E 's/^[[:space:]]+<t[dh][^>]*>//' \
| sed -E 's/<\/t[dh]><t[dh][^>]*>/|/g' \
| sed -E 's/<\/t[dh]>//' \
| grep -v '^\s*$'
column
and colrm
commands to format the outputIf you want a pretty-printed/formatted column/table view of the results to read in the console and scroll/page through, pipe the output further into the column
and colrm
commands, like this:
wget -q -O - "http://faostat.fao.org/site/384/default.aspx" \
| xmllint --html \
--xpath '//*[@id="ctl03_DesktopThreePanes1_ThreePanes_ctl01_MDlisting"]/*' - \
2>/dev/null \
| sed -E 's/<tr[^>]+>//' \
| sed -E 's/<\/tr>//' \
| sed -E 's/^[[:space:]]+<t[dh][^>]*>//' \
| sed -E 's/<\/t[dh]><t[dh][^>]*>/|/g' \
| sed -E 's/<\/t[dh]>//' \
| grep -v '^\s*$' \
| column -t -s '|' \
| colrm 14 21 | colrm 20 28 | colrm 63 95 | colrm 80
That will give you results like the output just below:
Results from formatting withcolumn
and colrm
Group Name Item FAO Code Item HS+ Code Item Name Definition
Crops 800 5304_c Agave fib Including int
Crops 221 0802.11_a Almonds, Prunus amygda
Crops 711 0909 Anise, ba Include: anis
Crops 515 0808.10_a Apples Malus pumila;
Crops 526 0809.10_a Apricots Prunus armeni
…
Alternatively, you could use the cut
command rather than colrm
to get the same formatting.
If rather than pretty-printed/formatted output like the above, you really do want real CSV, then you also must emit quotes around the fields, and CSV-escape existing quotes inside fields; like this:
Example: true CSV outputwget -q -O - "http://faostat.fao.org/site/384/default.aspx" \
| xmllint --html \
--xpath '//*[@id="ctl03_DesktopThreePanes1_ThreePanes_ctl01_MDlisting"]/tr' - \
| sed -E 's/"/""/g' \
| sed -E 's/<tr[^>]+>//' \
| sed -E 's/<\/tr>//' \
| sed -E 's/^[[:space:]]+<t[dh][^>]*>/"/' \
| sed -E 's/<\/t[dh]><t[dh][^>]*>/","/g' \
| sed -E 's/<\/t[dh]>/"/' \
| grep -v '^\s*$'
Tools that consume CSV apparently expect to see all quotation-mark characters escaped as two quotation-mark characters together; e.g., as the word ""fufu""
is below.
"In West Africa they are consumed mainly as ""fufu"", a stiff glutinous dough."
So the sed -E 's/"/""/g'
part of the code snippet above does that.
"Group Name","Item FAO Code","Item HS+ Code","Item Name ","Definition"
"Crops","800","5304_c","Agave fibres nes","Including inter alia: Haiti hemp…"
"Crops","221","0802.11_a","Almonds, with shell","Prunus amygdalus; P. communis…"
"Crops","711","0909","Anise, badian, fennel, coriander","Include: anise…"
(Obligatory disclaimer) All the above said, many people will tell you regexp-based HTML/XML processing is kludgy+error-prone. And it is, so use the above approach with caution (if at all).
If you have time to do it right, what you should be doing is: instead use a good Web-scraping library, or use Python+lxml
to actually process the results returned from evaluating the XPath expression (instead of stringifying the results), or use xsltproc
or some other XSLT engine.
But you only need something quick-n-dirty from the command line, the above gets the job done. However, it’s brittle, so don’t be shocked if some part of the output is broken in some unexpected way. If you want something robust for HTML/XML, don’t use a regexp-based approach .
Upvotes: 2