Gilthoniel
Gilthoniel

Reputation: 47

How to retrieve specific information from a file downloaded from a website using R

I'm trying to download DNA sequences from a webpage into a fastA file. I'm downloading the html webpage and am having trouble getting just the fasta information and not the html information for those non-bioinformaticians out there, a fasta file looks like this

>DNAsequencename

ACTGCGATGCGATGCAGCTAGCTGACG

(where the ACTG section is the DNA sequence)

I couldn't figure out how to just pull out the lines I wanted, so I tried a workaround by using read.fasta() to read the webpage data as a fasta file, which works except for the very last line, in which it always prints a non-DNA sentence, no matter what I do. I've tried some regex substitutions and grep to get just what I want or remove what I don't want and none have worked so far so I don't know what I'm doing wrong.

download.file("http://www.ng-mast.net/sql/fasta.asp?allele=POR",
              "webpage.txt", "auto", quiet=FALSE, mode = "w", 
               cacheOK = TRUE, headers = NULL)
lines <- readLines(con = "webpage.txt", encoding = "UTF-8")
fastadpor <- str_replace_all(lines, "[:print:]*&gt;POR", 
    ">POR_")
writeLines(fastadpor2, con = "portemp.fasta")
newfasta <- read.fasta(file = "portemp.fasta", as.string = 
    TRUE, forceDNAtolower = FALSE)
write.fasta(sequences = newfasta, names = names(newfasta), 
    file.out = "por.fasta")

The output file contains " global sequence and ST database
" at the end of it, and I don't know how to get rid of it.

Upvotes: 1

Views: 51

Answers (1)

Brian
Brian

Reputation: 8275

It's easiest if you scrape only the desired section of the webpage, not the whole thing. This can be done with a package like rvest, which lets you select certain HTML elements.

library(rvest)

allele <- 
  read_html("http://www.ng-mast.net/sql/fasta.asp?allele=POR") %>% 
  html_node("textarea") %>% 
  html_text()


writeLines(allele, "fasta.txt")

Upvotes: 1

Related Questions