Reputation: 47
I'm trying to download DNA sequences from a webpage into a fastA file. I'm downloading the html webpage and am having trouble getting just the fasta information and not the html information for those non-bioinformaticians out there, a fasta file looks like this
>DNAsequencename
ACTGCGATGCGATGCAGCTAGCTGACG
(where the ACTG section is the DNA sequence)
I couldn't figure out how to just pull out the lines I wanted, so I tried a workaround by using read.fasta() to read the webpage data as a fasta file, which works except for the very last line, in which it always prints a non-DNA sentence, no matter what I do. I've tried some regex substitutions and grep to get just what I want or remove what I don't want and none have worked so far so I don't know what I'm doing wrong.
download.file("http://www.ng-mast.net/sql/fasta.asp?allele=POR",
"webpage.txt", "auto", quiet=FALSE, mode = "w",
cacheOK = TRUE, headers = NULL)
lines <- readLines(con = "webpage.txt", encoding = "UTF-8")
fastadpor <- str_replace_all(lines, "[:print:]*>POR",
">POR_")
writeLines(fastadpor2, con = "portemp.fasta")
newfasta <- read.fasta(file = "portemp.fasta", as.string =
TRUE, forceDNAtolower = FALSE)
write.fasta(sequences = newfasta, names = names(newfasta),
file.out = "por.fasta")
The output file contains " global sequence and ST database
" at the end of it, and I don't know how to get rid of it.
Upvotes: 1
Views: 51
Reputation: 8275
It's easiest if you scrape only the desired section of the webpage, not the whole thing. This can be done with a package like rvest
, which lets you select certain HTML elements.
library(rvest)
allele <-
read_html("http://www.ng-mast.net/sql/fasta.asp?allele=POR") %>%
html_node("textarea") %>%
html_text()
writeLines(allele, "fasta.txt")
Upvotes: 1