user4687531
user4687531

Reputation: 1111

xpath parse table in bash

I have a html table which I would like to parse out with bash (Note: I have used R to do this but would like to attempt in bash to easily integrate with another shell script).

The table can be obtained from the url below: http://faostat.fao.org/site/384/default.aspx

By reviewing source - the xpath reference for the specific table is:

//*[@id="ctl03_DesktopThreePanes1_ThreePanes_ctl01_MDlisting"]

How can I parse out this table to a csv file from bash directly?

I tried the following:

curl "http://faostat.fao.org/site/384/default.aspx" | xpath '//*[@id="ctl03_DesktopThreePanes1_ThreePanes_ctl01_MDlisting"]' > test.txt

This only returns a blank text for test.txt.

Could anyone please help me in parsing out the valid html table using xpath in bash and creating a CSV file of it?

Any help appreciated.

Upvotes: 1

Views: 948

Answers (1)

sideshowbarker
sideshowbarker

Reputation: 87984

//*[@id="ctl03_DesktopThreePanes1_ThreePanes_ctl01_MDlisting"]/tr (that is, with /tr appended to the XPath expression you had in your question) will grab just each row, and skip the table wrapper (which you don’t need to do anything with in your output).

Then you also need to pipe that xmllint --xpath output through sed or perl or something:

Example: perl version
wget -q -O - "http://faostat.fao.org/site/384/default.aspx" \
   | xmllint --html \
     --xpath '//*[@id="ctl03_DesktopThreePanes1_ThreePanes_ctl01_MDlisting"]/*' - \
     2>/dev/null \
   | perl -pe 's/<tr[^>]+>//' \
   | perl -pe 's/<\/tr>//' \
   | perl -pe 's/^\s+<t[dh][^>]*>//' \
   | perl -pe 's/<\/t[dh]><t[dh][^>]*>/|/g' \
   | perl -pe 's/<\/t[dh]>//' \
   | grep -v '^\s*$'
Example: sed version
wget -q -O - "http://faostat.fao.org/site/384/default.aspx" \
   | xmllint --html \
     --xpath '//*[@id="ctl03_DesktopThreePanes1_ThreePanes_ctl01_MDlisting"]/*' - \
     2>/dev/null \
   | sed -E 's/<tr[^>]+>//' \
   | sed -E 's/<\/tr>//' \
   | sed -E 's/^[[:space:]]+<t[dh][^>]*>//' \
   | sed -E 's/<\/t[dh]><t[dh][^>]*>/|/g' \
   | sed -E 's/<\/t[dh]>//' \
   | grep -v '^\s*$'

In both cases, the grep -v '^\s*$' is there just to remove blank lines.

It’s not strictly CSV; it separates the fields/cells with a | (pipe) character, rather than a comma—because some (many) of the fields themselves also have commas and quotation marks in them. If you really true CSV, than scroll down and read How to generate true CSV for this case below.


Using python and lxml instead

As an alternative to xmllint --xpath, you can instead use Python and the lxml.html library:

wget -q -O - "http://faostat.fao.org/site/384/default.aspx" \
   | python -c "import lxml.html as html; import sys; \
       expr = sys.argv[1]; print '\n'.join([html.tostring(el) \
       for el in html.parse(sys.stdin).xpath(expr)])" \
       '//*[@id="ctl03_DesktopThreePanes1_ThreePanes_ctl01_MDlisting"]//tr' \
   | sed -E 's/<tr[^>]+>//' \
   | sed -E 's/<\/tr>//' \
   | sed -E 's/^[[:space:]]+<t[dh][^>]*>//' \
   | sed -E 's/<\/t[dh]><t[dh][^>]*>/|/g' \
   | sed -E 's/<\/t[dh]>//' \
   | grep -v '^\s*$'

Using the column and colrm commands to format the output

If you want a pretty-printed/formatted column/table view of the results to read in the console and scroll/page through, pipe the output further into the column and colrm commands, like this:

wget -q -O - "http://faostat.fao.org/site/384/default.aspx" \
   | xmllint --html \
     --xpath '//*[@id="ctl03_DesktopThreePanes1_ThreePanes_ctl01_MDlisting"]/*' - \
     2>/dev/null \
   | sed -E 's/<tr[^>]+>//' \
   | sed -E 's/<\/tr>//' \
   | sed -E 's/^[[:space:]]+<t[dh][^>]*>//' \
   | sed -E 's/<\/t[dh]><t[dh][^>]*>/|/g' \
   | sed -E 's/<\/t[dh]>//' \
   | grep -v '^\s*$' \
   | column -t -s '|' \
   | colrm 14 21 | colrm 20 28 | colrm 63 95 | colrm 80

That will give you results like the output just below:

Results from formatting with column and colrm
Group Name         Item FAO Code    Item HS+ Code    Item Name      Definition
Crops              800              5304_c           Agave fib      Including int
Crops              221              0802.11_a        Almonds,       Prunus amygda
Crops              711              0909             Anise, ba      Include: anis
Crops              515              0808.10_a        Apples         Malus pumila;
Crops              526              0809.10_a        Apricots       Prunus armeni
…

Alternatively, you could use the cut command rather than colrm to get the same formatting.


How to generate true CSV

If rather than pretty-printed/formatted output like the above, you really do want real CSV, then you also must emit quotes around the fields, and CSV-escape existing quotes inside fields; like this:

Example: true CSV output
wget -q -O - "http://faostat.fao.org/site/384/default.aspx" \
   | xmllint --html \
     --xpath '//*[@id="ctl03_DesktopThreePanes1_ThreePanes_ctl01_MDlisting"]/tr' - \
   | sed -E 's/"/""/g' \ 
   | sed -E 's/<tr[^>]+>//' \
   | sed -E 's/<\/tr>//' \
   | sed -E 's/^[[:space:]]+<t[dh][^>]*>/"/' \
   | sed -E 's/<\/t[dh]><t[dh][^>]*>/","/g' \
   | sed -E 's/<\/t[dh]>/"/' \
   | grep -v '^\s*$'

Tools that consume CSV apparently expect to see all quotation-mark characters escaped as two quotation-mark characters together; e.g., as the word ""fufu"" is below.

  "In West Africa they are consumed mainly as ""fufu"", a stiff glutinous dough."

So the sed -E 's/"/""/g' part of the code snippet above does that.

CSV output from the above example
"Group Name","Item FAO Code","Item HS+ Code","Item Name ","Definition"
"Crops","800","5304_c","Agave fibres nes","Including inter alia: Haiti hemp…"
"Crops","221","0802.11_a","Almonds, with shell","Prunus amygdalus; P. communis…"
"Crops","711","0909","Anise, badian, fennel, coriander","Include: anise…"

Disclaimer: You should avoid doing regexp-based processing of HTML/XML

(Obligatory disclaimer) All the above said, many people will tell you regexp-based HTML/XML processing is kludgy+error-prone. And it is, so use the above approach with caution (if at all).

If you have time to do it right, what you should be doing is: instead use a good Web-scraping library, or use Python+lxml to actually process the results returned from evaluating the XPath expression (instead of stringifying the results), or use xsltproc or some other XSLT engine.

But you only need something quick-n-dirty from the command line, the above gets the job done. However, it’s brittle, so don’t be shocked if some part of the output is broken in some unexpected way. If you want something robust for HTML/XML, don’t use a regexp-based approach .

Upvotes: 2

Related Questions