BRZ
BRZ

Reputation: 695

Count the occurrences of word by pattern in R

Perhaps an oft asked question, am royally stuck here.

From an XML File, I'm trying to search for all occurrences, their lines and the total count of occurrence of each 12 character string containing only alpha and numerals (literally alpha-numeric).

For example: if my file is xmlInput, I'm trying to search and extract all the occurrences,positions and total counts of a 12-character alpha-num string.

Example output:

  String        Total Count     Line-Num
 CPXY180D2324   2               132,846
 CPXY180D2131   1               372
 CPCY180D2139   1               133       

I know that, I could use regmatches to get all occurrences of a string by pattern. I've been using the below for that: (Thanks to your help on this).

ProNum12<-regmatches(xmlInput, regexpr("([A-Z0-9]{12})", xmlInput))
ProNum12

regmatches give me all the matches that follow the pattern. but it doesnt give me the line numbers of where the pattern appeared. grep gives me the line numbers of all occurrences.

I thought I could use the textcnt package of library Tau but couldnt get it to run correctly. Perhaps it is not the right package?

Is there a package/library in R which will search for all words matching the pattern and return the total count of appearence and linenumers of each occurrence? If no such pacakge exists, any idea how I can do this using any of the above or better?

Upvotes: 3

Views: 4095

Answers (1)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

Without seeing your data, it is hard to offer a suggestion on how to proceed. Here is an example with some plain character strings that might help you get started on finding a solution of your own.

First, some sample data (which probably looks nothing like your data):

x <- c("Some text with a strange CPXY180D2324 string stuck in it.", 
       "Some more text with CPXY180D2131 strange strings CPCY180D2139 stuck in it.", 
       "Even more text with strings that CPXY180D2131 don't make much sense.", 
       "I'm CPXY180D2324 tired CPXY180D2324 of CPXY180D2324 text with CPXY180D2131 strange strings CPCY180D2139 stuck in it.")

We can split it by spaces. This is another area it might not fit with your actual problem, but again, this is just to help you get started (or help others provide a much better answer, as may be the case.)

x2 <- strsplit(x, " ")

Search the split data for values matching your regex pattern. Create a data.frame that includes the line numbers and the matched string.

temp <- do.call(rbind, lapply(seq_along(x2), function(y) { 
  data.frame(line = y,
             value = grep("([A-Z0-9]{12})", x2[[y]], 
                          value = TRUE))
}))
temp
#   line        value
# 1    1 CPXY180D2324
# 2    2 CPXY180D2131
# 3    2 CPCY180D2139
# 4    3 CPXY180D2131
# 5    4 CPXY180D2324
# 6    4 CPXY180D2324
# 7    4 CPXY180D2324
# 8    4 CPXY180D2131
# 9    4 CPCY180D2139

Create your data.frame of line numbers and counts.

with(temp, data.frame(
  lines = tapply(line, value, paste, collapse = ", "),
  count = tapply(line, value, length)))
#                   lines count
# CPXY180D2324 1, 4, 4, 4     4
# CPCY180D2139       2, 4     2
# CPXY180D2131    2, 3, 4     3

Anyway, this is purely a guess (and me killing time....)

Upvotes: 4

Related Questions