ChristianD
ChristianD

Reputation: 101

Convert an HTML expandable list into tabular or csv format

I know this topic has been covered a few times but i could not find a case that would apply to mine. I am not an experienced computer user please keep that in mind, although i can play with bash, R and possibly run a perl script too. FYI - I run Ubuntu on my machine.

What i would like to do is to convert the expandable list of the following web page http://www.genome.jp/kegg-bin/get_htext?br08902.keg (please expand completely using the "one-click mode") into a tabular or csv format, where each level of indentation goes to a separate column.

It wouldn't be so bad as well to have the parent categories to be repeated for all the elements grouped below that. Something like the tab below that i made manually for the first few lines of the page.

Pathways and Ontologies Pathways    br08901  KEGG pathway maps
Pathways and Ontologies Functional hierarchies  br08902  BRITE functional hierarchies
Genes and Proteins  Orthologs and modules   ko00001  KEGG Orthology (KO)
Genes and Proteins  Orthologs and modules   ko00002  KEGG pathway modules
Genes and Proteins  Orthologs and modules   ko00003  KEGG modules and reaction modules
Genes and Proteins  Protein families: metabolism    ko01000  Enzymes
Genes and Proteins  Protein families: metabolism    ko01001  Protein kinases
Genes and Proteins  Protein families: metabolism    ko01009  Protein phosphatases and associated proteins
Genes and Proteins  Protein families: metabolism    ko01002  Peptidases
Genes and Proteins  Protein families: metabolism    ko01003  Glycosyltransferases
Genes and Proteins  Protein families: metabolism    ko01005  Lipopolysaccharide biosynthesis proteins
Genes and Proteins  Protein families: metabolism    ko01004  Lipid biosynthesis proteins

Thanks in advance!

Upvotes: 1

Views: 681

Answers (1)

clt60
clt60

Reputation: 63962

This task need some well separated steps. Breakdown:

Getting the content of the page. You can use the for example the curl or wget or fetch or similar programs. E.g.

curl http://...

will download the page content.

In the your page, exist a link "download htext". When you check, where it points, you will discover than you need download from the link

http://www.kegg.jp/kegg-bin/download_htext?htext=br08902.keg&format=htext&filedir=
                                                 ^^^^^^^^^^^ name of your needed keg

so after the

curl "http://www.kegg.jp/kegg-bin/download_htext?htext=br08902.keg&format=htext&filedir=" > mykeg.txt

will get a file what looks the next: (shortened)

+C      Br number
#<h2><a href="/kegg/kegg2.html"><img src="/Fig/bget/kegg3.gif" align="middle" border=0></a>&nbsp; BRITE Functional Hierarchies</h2>
#<!---
#ENTRY       br08902
#NAME        Brite
#DEFINITION  BRITE functional hierarchies
#--->
!
A<b>Pathways and Ontologies</b>
B  Pathways
C    br08901  KEGG pathway maps
B  Functional hierarchies
C    br08902  BRITE functional hierarchies
#
A<b>Genes and Proteins</b>
B  Orthologs and modules
C    ko00001  KEGG Orthology (KO)
C    ko00002  KEGG pathway modules

It is a nice text file, mostly without HTML markup. Easily parseable with common bash tools.

First some cleaning up:

removing all unwanted lines with the sed command

sed '/^[#!+]/d'

remove unwanted html markup (generally impossible with regexes, but in this case possible)

sed 's/<[^>]*>//g'

add delimiter to the leading character

sed 's/^./& /'

after the above, you get a text like the next

A Pathways and Ontologies
B   Pathways
C     br08901  KEGG pathway maps
B   Functional hierarchies
C     br08902  BRITE functional hierarchies
A Genes and Proteins
B   Orthologs and modules
C     ko00001  KEGG Orthology (KO)
C     ko00002  KEGG pathway modules
C     ko00003  KEGG modules and reaction modules

what is have a nice, parseable structure with bash

while read -r prefix content
do
     echo "do something with a line >>$content<< with a prefix >>$prefix<<"
done

you can test the prefix for example with the case command, like:

case "$prefix" in
    A) a="$content" ;;
    B) b="$content" ;;
    C) c="$content" ;;
esac

exists an nicer alternative using associative arrays, but the above is simple and working...

You now have all informations what you need to make an working solution (in 8 lines).

The next is up to you... ;)

Edit

Usually im not doing the whole work, because the stackoverflow is not a free-programming service, but ok - here is the script:

kegfile="KEG"
while read -r prefix content
do
    case "$prefix" in
        A) col1="$content" ;;
        B) col2="$content" ;;
        C) echo -e "$col1\t$col2\t$content";;
    esac
done < <(sed '/^[#!+]/d;s/<[^>]*>//g;s/^./& /' < "$kegfile")

Upvotes: 3

Related Questions