bgreen
bgreen

Reputation: 87

[readtext]: download files from the Internet to remove text via stringi and read the file into Quanteda

My aim is to read multiple text files into Quanteda, first removing unwanted text that is contained within # marks. Stringi code has been provided to perform this task, however, problems were encountered reading the file in Quanteda, regarding the argument not being an atomic vector; coercing.

In response to a request to provide a reproducible question, I have posted a data sample here: http://home.brisnet.org.au/~bgreen/Data/
When I tried to read the data via readtext I received this error: > txtdat = readtext ("http://home.brisnet.org.au/~bgreen/Data/") Error in download_remote(file, ignore_missing, cache, verbosity) : Remote URL does not end in known extension. Please download the file manually. Below I have posted code that enables 20 files to be downloaded from the site.

In the event a reader can't access the files via readtext, the following code will download them to a temp file in your documents folder.

[CODE]
 suppressPackageStartupMessages({
library(rvest)
 })

  # destination directory, change this at will
  dest_dir <- "~/Temp"

 # first get the two subfolders from the Data webpage
 link <- "http://home.brisnet.org.au/~bgreen/Data/"
 page <- read_html(link)
 page %>%
 html_elements("a") %>%
 html_text() %>%
 grep("/$", ., value = TRUE) -> sub_folder

 # create relevant disk sub-directories, if
 # they do not exist yet
 for(subf in sub_folder) {
 d <- file.path(dest_dir, subf)
 if(!dir.exists(d)) {
 success <- dir.create(d)
 msg <- paste("created directory", d, "-", success)
 message(msg)
   }
 }

 # prepare to download the files
 dest_dir <- file.path(dest_dir, sub_folder)
 source_url <- paste0(link, sub_folder)

 success <- mapply(\(src, dest) {
 # read each Data subfolder
 # and get the file names therein
 # then lapply 'download.file' to each filename
 pg <- read_html(src)
 pg %>%
 html_elements("a") %>%
 html_text() %>%
 grep("\\.txt$", ., value = TRUE) %>%
 lapply(\(x) {
  s <- paste0(src, x)
  d <- file.path(dest, x)
  tryCatch(
    download.file(url = s, destfile = d),
    warning = function(w) w,
    error = function(e) e
    )
   })
  }, source_url, dest_dir)

  lengths(success)  

[CODE]

I then want to run this code to remove all text between # marks but retain remaining text

 [CODE] 
  library("stringi")
  toks <- stringi::stri_replace_all_regex(x, "#.*#\n{2}", "") |>
  tokens()
  [CODE] 

If there is a better way to provide reproducible data (multiple text files), please let me know. The actual data is useful, because I'm not sure the stringi code is removing all the required text. Thanks.

Upvotes: 1

Views: 34

Answers (1)

bgreen
bgreen

Reputation: 87

This code from Jim Holtman, written years ago, does what I want. The only thing I had to change was I had to add the subfolder names:

[CODE] library(tidyverse)

# this will read in the text files and delete data between "##"s
# there is a new directory 'deleted' which will hold the changed files.

# directory that holds the '.txt' files
data_path <- "E:/Hanson/Data/"  # *******change to what you need ********

# get files to process
files <- list.files(path = data_path,
                pattern = "txt$")


# directory to hold the changed files
dir.create(file.path(data_path, 'deleted'),
       showWarnings = FALSE)


# read in each file and delete text between "#..#"
for (file_name in files) {
input <- read_file(file.path(data_path, file_name))

# cycle through all patterns
input <- str_replace_all(input,
                       regex("#.*?#", dotall = TRUE),
                       "")

# write back the changed file
 write_file(input,
         file.path(data_path,
                   'deleted',
                   file_name))
}

[CODE]

Upvotes: 1

Related Questions