Reputation: 101
I know this topic has been covered a few times but i could not find a case that would apply to mine. I am not an experienced computer user please keep that in mind, although i can play with bash, R and possibly run a perl script too. FYI - I run Ubuntu on my machine.
What i would like to do is to convert the expandable list of the following web page http://www.genome.jp/kegg-bin/get_htext?br08902.keg (please expand completely using the "one-click mode") into a tabular or csv format, where each level of indentation goes to a separate column.
It wouldn't be so bad as well to have the parent categories to be repeated for all the elements grouped below that. Something like the tab below that i made manually for the first few lines of the page.
Pathways and Ontologies Pathways br08901 KEGG pathway maps
Pathways and Ontologies Functional hierarchies br08902 BRITE functional hierarchies
Genes and Proteins Orthologs and modules ko00001 KEGG Orthology (KO)
Genes and Proteins Orthologs and modules ko00002 KEGG pathway modules
Genes and Proteins Orthologs and modules ko00003 KEGG modules and reaction modules
Genes and Proteins Protein families: metabolism ko01000 Enzymes
Genes and Proteins Protein families: metabolism ko01001 Protein kinases
Genes and Proteins Protein families: metabolism ko01009 Protein phosphatases and associated proteins
Genes and Proteins Protein families: metabolism ko01002 Peptidases
Genes and Proteins Protein families: metabolism ko01003 Glycosyltransferases
Genes and Proteins Protein families: metabolism ko01005 Lipopolysaccharide biosynthesis proteins
Genes and Proteins Protein families: metabolism ko01004 Lipid biosynthesis proteins
Thanks in advance!
Upvotes: 1
Views: 681
Reputation: 63962
This task need some well separated steps. Breakdown:
Getting the content of the page. You can use the for example the curl
or wget
or fetch
or similar programs. E.g.
curl http://...
will download the page content.
In the your page, exist a link "download htext". When you check, where it points, you will discover than you need download from the link
http://www.kegg.jp/kegg-bin/download_htext?htext=br08902.keg&format=htext&filedir=
^^^^^^^^^^^ name of your needed keg
so after the
curl "http://www.kegg.jp/kegg-bin/download_htext?htext=br08902.keg&format=htext&filedir=" > mykeg.txt
will get a file what looks the next: (shortened)
+C Br number
#<h2><a href="/kegg/kegg2.html"><img src="/Fig/bget/kegg3.gif" align="middle" border=0></a> BRITE Functional Hierarchies</h2>
#<!---
#ENTRY br08902
#NAME Brite
#DEFINITION BRITE functional hierarchies
#--->
!
A<b>Pathways and Ontologies</b>
B Pathways
C br08901 KEGG pathway maps
B Functional hierarchies
C br08902 BRITE functional hierarchies
#
A<b>Genes and Proteins</b>
B Orthologs and modules
C ko00001 KEGG Orthology (KO)
C ko00002 KEGG pathway modules
It is a nice text file, mostly without HTML markup. Easily parseable with common bash tools.
First some cleaning up:
removing all unwanted lines with the sed
command
sed '/^[#!+]/d'
remove unwanted html markup (generally impossible with regexes, but in this case possible)
sed 's/<[^>]*>//g'
add delimiter to the leading character
sed 's/^./& /'
after the above, you get a text like the next
A Pathways and Ontologies
B Pathways
C br08901 KEGG pathway maps
B Functional hierarchies
C br08902 BRITE functional hierarchies
A Genes and Proteins
B Orthologs and modules
C ko00001 KEGG Orthology (KO)
C ko00002 KEGG pathway modules
C ko00003 KEGG modules and reaction modules
what is have a nice, parseable structure with bash
while read -r prefix content
do
echo "do something with a line >>$content<< with a prefix >>$prefix<<"
done
you can test the prefix
for example with the case
command, like:
case "$prefix" in
A) a="$content" ;;
B) b="$content" ;;
C) c="$content" ;;
esac
exists an nicer alternative using associative arrays
, but the above is simple and working...
You now have all informations what you need to make an working solution (in 8 lines).
The next is up to you... ;)
Usually im not doing the whole work, because the stackoverflow is not a free-programming service, but ok - here is the script:
kegfile="KEG"
while read -r prefix content
do
case "$prefix" in
A) col1="$content" ;;
B) col2="$content" ;;
C) echo -e "$col1\t$col2\t$content";;
esac
done < <(sed '/^[#!+]/d;s/<[^>]*>//g;s/^./& /' < "$kegfile")
Upvotes: 3