Reputation: 87
I am hoping for advice on removing text from a corpus of 833 text files. I want to remove any text contained within # characters. For example:
#Monday, 17 November 2027#
It is necessary to repeat some of the rhetoric contained in the government's national forest policy statement and compare it with reality. It states:
#the governments must establish clear and consistent policies for resource development, providing secure access to resources and consistent environmental guidelines. . . . . . . . . . A range of sustainable forest based industries, founded on excellence and innovation, will be expanding to contribute further to regional and national economic and employment growth. . . . . . . . . . . . .. governments acknowledge their role in seeking to minimise any adverse social and economic effects of the structural adjustment process, particularly where alternative employment is not always available.#
Extensive areas of productive forest, which have sustained rural economies and jobs throughout NSW for decades, have quickly been declared national park and wilderness. This action has occurred before Regional Forest Agreements have been completed.
I want only the following text:
It is necessary to repeat some of the rhetoric contained in the government's national forest policy statement and compare it with reality. It states:
Extensive areas of productive forest, which have sustained rural economies and jobs throughout NSW for decades, have quickly been declared national park and wilderness. This action has occurred before Regional Forest Agreements have been completed.
The file structure follows:
txtdata = readtext("E:/H/Data/*") readtext object consisting of 833 documents and 0 docvars.
doc_id text
1 #10_3-7-98.txt ""#Date Frid"..."
2 #11_2-7-98.txt ""#Date Thur"..."
3 #12_30-6-98.txt ""#Date Tues"..."
4 #13_29-6-98.txt ""#Date Mond"..."
5 #14_29-6-98.txt ""#Date Mond"..."
6 #15_29-6-98.txt ""# Date Mon"..."
Upvotes: 0
Views: 50
Reputation: 87
This seems to have done the job:
Blockquote
x - list.files("C:/Data/files/*", recursive = TRUE)
library("stringi")
stri_replace_all_regex(x, "#.*#\n{2}", "") |>
cat()
Blockquote
Upvotes: 0
Reputation: 14902
No real need for quanteda here, you can remove the spans between #
characters using regular expression replacement. I prefer the excellent stringi package for this.
The regular expression removes all characters (.*
) between spans and the \n{2}
cleans things up a bit by also removing the two newline characters present after the removed span.
txt <- "#Monday, 17 November 2027#
It is necessary to repeat some of the rhetoric contained in the government's national forest policy statement and compare it with reality. It states:
#the governments must establish clear and consistent policies for resource development, providing secure access to resources and consistent environmental guidelines. . . . . . . . . . A range of sustainable forest based industries, founded on excellence and innovation, will be expanding to contribute further to regional and national economic and employment growth. . . . . . . . . . . . .. governments acknowledge their role in seeking to minimise any adverse social and economic effects of the structural adjustment process, particularly where alternative employment is not always available.#
Extensive areas of productive forest, which have sustained rural economies and jobs throughout NSW for decades, have quickly been declared national park and wilderness. This action has occurred before Regional Forest Agreements have been completed."
library("stringi")
stri_replace_all_regex(txt, "#.*#\n{2}", "") |>
cat()
#> It is necessary to repeat some of the rhetoric contained in the
#> government's national forest policy statement and compare it with
#> reality. It states:
#>
#> Extensive areas of productive forest, which have sustained rural
#> economies and jobs throughout NSW for decades, have quickly been
#> declared national park and wilderness. This action has occurred before
#> Regional Forest Agreements have been completed.
Created on 2023-07-19 with reprex v2.0.2
Upvotes: 0