Reputation: 53
First question here, so apologises for any faux-pas. I have a dataframe in R of 657 observations with 4 variables. Each observation is a speech or interview by the Australian Prime Minister. So the variables are:
I'm trying to turn that into a corpus in Quanteda
I first tried corp <- corpus(all_content)
but that gave me an error message
Error in corpus.data.frame(all_content) :
text_field column not found or invalid
This worked though: corp <- corpus(paste(all_content))
Then summary(corp)
which gave me
Corpus consisting of 4 documents, showing 4 documents:
Text Types Tokens Sentences
text1 243 1316 1
text2 1095 6523 3
text3 661 2630 1
text4 25243 1867648 62572
My understand is that what this has done is effectively turn each column into a document, rather than each row?
If it matters, the txt
variable is saved as a list. The code used to create each row is
```{r new_function}
scrape_speech <- function(url){
speech_page <- read_html(url)
date <- speech_page %>% html_nodes(".date-display-single") %>% html_text() %>% dmy()
title <- speech_page %>% html_nodes(".pagetitle") %>% html_text()
txt <- speech_page %>% html_nodes("#block-system-main p") %>% html_text() %>% list()
tibble (date = date, title = title, URL = url, txt=txt)}
I then used the map_dfr
function to go through and scrape the 657 separate URLs.
Someone has suggested to me it is because the txt
is saved as a list. I've tried without the list()
in the function and I get 21,904 observations, as each paragraph in the full text document turns into a separate observation. I can turn that into a corpus with corp <- corpus(paste(all_content_not_list))
(Once again, without the paste
I get the same error as above). That similarly gives me 4 documents in the corpus!
summary(corp)
Gives me
Corpus consisting of 4 documents, showing 4 documents:
Text Types Tokens Sentences
text1 243 43810 1
text2 1092 214970 25
text3 657 87618 1
text4 25243 1865687 62626
Thanks in advance Daniel
Upvotes: 2
Views: 1165
Reputation: 14902
It's hard to address this problem exactly, because there is no reproducible example of your data.frame object, but if the structure contains the variables you list, then this should do it:
corpus(all_content, text_field = "txt")
See ?corpus.data.frame
for details. If that does not do it, then try adding the output to your question of
str(all_content)
so that we can see in more detail what is in your all_content
object.
Edited following OP's addition of new data:
OK so txt
in your tibble is a list of character elements. You need to combine these into a single character in order use this as an input into corpus.data.frame()
. Here's how:
library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
dframe <- structure(list(
date = structure(18620, class = "Date"),
title = " Prime Minister's Christmas Message to the ADF",
URL = "https://www.pm.gov.au/media/prime-ministers-christmas-message-adf",
txt = list(c(
"G'day and Merry Christmas to everyone in our Australian Defence Force.",
"You know, throughout our history, successive Australian governments... And this year was no different.",
"God bless."
))
),
row.names = c(NA, -1L),
class = c("tbl_df", "tbl", "data.frame")
)
dframe$txt <- vapply(dframe$txt, paste, character(1), collapse = " ")
corp <- corpus(dframe, text_field = "txt")
print(corp, max_nchar = -1)
## Corpus consisting of 1 document and 3 docvars.
## text1 :
## "G'day and Merry Christmas to everyone in our Australian Defence Force. You know, throughout our history, successive Australian governments... And this year was no different. God bless."
Created on 2021-04-08 by the reprex package (v1.0.0)
Upvotes: 1