Ant
Ant

Reputation: 343

for loop not iterating through every row

I have a corpus of text, consisting of multiple MS Word files, that I would like to analyse. As the corpus is large (~10,000 lines) and nlp (using the cleanNLP package) analysis takes a long time and frequently crashes, I thought I could loop through the text line by line and analyse each one separately.

I've written the following loop, which aims to take each line of the initial text, extract any location entities and store the details in the next empty line of the matrix text_mat.

#read in text corpus
all <- read_dir("N:/data/All")

#convert into dataframe usable by text packages
all_df <- tibble(line = 1:nrow(all), text = all$content)

#loop through every line for location extraction
#create unpopulated matrix
text_mat <- matrix(NA, nrow = nrow(all_df), ncol = 3)

#loop through each line, fill matrix with location output
for (i in nrow(all_df)) {
  line <- all_df[i, ]
  obj_line <- cnlp_annotate(line, as_strings = TRUE)
  loc <- cnlp_get_entity(obj_line) %>%
    filter(entity_type == "CITY" | entity_type == "LOCATION") %>%
    group_by(entity) %>%
    tally() %>%
    arrange(desc(n)) %>%
    rename("Count" = "n")
  text_mat[i, ] <- c(i, loc$entity, loc$Count)
  next 
}

#convert matrix to data frame
entity_df <- as.data.frame(text_mat)  

When I run the loop it completes very quickly - I would expect this to take at least a few minutes, and the text_mat matrix remains empty. This makes me think that the loop is only analysing the first line of text and then completing but I'm not sure why. Any help as to why this would be the case would be greatly appreciated.

Upvotes: 0

Views: 278

Answers (1)

Andrey Shabalin
Andrey Shabalin

Reputation: 4614

The loop initiation should be for (i in 1:nrow(all_df)), not for (i in nrow(all_df)).

Then you'll run it for all rows, not just the last one.

Upvotes: 3

Related Questions