Reputation: 21
I am using char_segment from Quanteda library to separate multiple documents from one file separatted by a pattern, this command works great and easily! (I did try with str_match and strsplit but without success).
Lamentably I am unable to get the filename as a Variable, this is key to next analysis.example
Example of my commands:
Library(quanteda)
doc <- readtext(paste0("PATH/*.docx"))
View(doc)
docc=char_segment(doc$text, pattern = ",", remove_pattern = TRUE)
Please any suggestion or other options to split documents are welcome.
Upvotes: 0
Views: 167
Reputation: 5898
Simply get the list of your docx files first, it will yield the name of the files. Then run the char_segment function on them them by a lapply, loop, or purrr::map()
The following code assumes that your target documents are stored in a directory called "docx" within your working directory.
library(quanteda)
library(readtext) ## Remember to include in your posts the libraries required to replicate the code.
list_of_docx <- list.files(path = "./docx", ## Looks inside the ./docx directory
full.names = TRUE, ## retrieves the full path to the documents
pattern = "[.]docx$", ## retrieves al documents whose names ends in ".docx"
ignore.case = TRUE) ## ignores the letter case of the document's names
df_docx <- data.frame() ## Create an empty dataframe to store your data
for (d in seq_along(list_of_docx)) { ## Tell R to run the loop/iterate along the number of elements within thte list of doccument paths
temp_object <-readtext(list_of_docx[d])
temp_segmented_object <- char_segment(temp_object$text, pattern = ",", remove_pattern = TRUE)
temp_df <- as.data.frame(temp_segmented_object)
colnames(temp_df) <- "segments"
temp_df$title <- as.character(list_of_docx[d]) ## Create a variable with the title of the source document
temp_df <- temp_df[, c("title", "segments")]
df_docx <- rbind(df_docx, temp_df) ## Append each dataframe to the previously created empty dataframe
rm(temp_df, temp_object, d)
df_docx
}
head(df_docx)
Upvotes: 0
Reputation: 21
Example when I read text This is my problem, when I separe documents by ###*
Upvotes: 0
Reputation: 880
You should have names of Word files already:
require(readtext)
data_dir <- system.file("extdata/", package = "readtext")
readtext(paste0(data_dir, "/word/*"))
readtext object consisting of 6 documents and 0 docvars.
# data.frame [6 × 2]
doc_id text
<chr> <chr>
1 21Parti_Socialiste_SUMMARY_2004.doc "\"[pic]\nRésu\"..."
2 21vivant2004.doc "\"http://www\"..."
3 21VLD2004.doc "\"http://www\"..."
4 32_socialisti_democratici_italiani.doc "\"DIVENTARE \"..."
5 UK_2015_EccentricParty.docx "\"The Eccent\"..."
6 UK_2015_LoonyParty.docx "\"The Offici\"..."
They are passed to quanteda's downstream objects as document names.
Upvotes: 0