Reputation: 695
Perhaps an oft asked question, am royally stuck here.
From an XML File, I'm trying to search for all occurrences, their lines and the total count of occurrence of each 12 character string containing only alpha and numerals (literally alpha-numeric).
For example: if my file is xmlInput
, I'm trying to search and extract all the occurrences,positions and total counts of a 12-character alpha-num string.
Example output:
String Total Count Line-Num
CPXY180D2324 2 132,846
CPXY180D2131 1 372
CPCY180D2139 1 133
I know that, I could use regmatches
to get all occurrences of a string by pattern. I've been using the below for that: (Thanks to your help on this).
ProNum12<-regmatches(xmlInput, regexpr("([A-Z0-9]{12})", xmlInput))
ProNum12
regmatches
give me all the matches that follow the pattern. but it doesnt give me the line numbers of where the pattern appeared. grep
gives me the line numbers of all occurrences.
I thought I could use the textcnt
package of library Tau
but couldnt get it to run correctly. Perhaps it is not the right package?
Is there a package/library in R which will search for all words matching the pattern and return the total count of appearence and linenumers of each occurrence? If no such pacakge exists, any idea how I can do this using any of the above or better?
Upvotes: 3
Views: 4095
Reputation: 193517
Without seeing your data, it is hard to offer a suggestion on how to proceed. Here is an example with some plain character strings that might help you get started on finding a solution of your own.
First, some sample data (which probably looks nothing like your data):
x <- c("Some text with a strange CPXY180D2324 string stuck in it.",
"Some more text with CPXY180D2131 strange strings CPCY180D2139 stuck in it.",
"Even more text with strings that CPXY180D2131 don't make much sense.",
"I'm CPXY180D2324 tired CPXY180D2324 of CPXY180D2324 text with CPXY180D2131 strange strings CPCY180D2139 stuck in it.")
We can split it by spaces. This is another area it might not fit with your actual problem, but again, this is just to help you get started (or help others provide a much better answer, as may be the case.)
x2 <- strsplit(x, " ")
Search the split data for values matching your regex pattern. Create a data.frame
that includes the line numbers and the matched string.
temp <- do.call(rbind, lapply(seq_along(x2), function(y) {
data.frame(line = y,
value = grep("([A-Z0-9]{12})", x2[[y]],
value = TRUE))
}))
temp
# line value
# 1 1 CPXY180D2324
# 2 2 CPXY180D2131
# 3 2 CPCY180D2139
# 4 3 CPXY180D2131
# 5 4 CPXY180D2324
# 6 4 CPXY180D2324
# 7 4 CPXY180D2324
# 8 4 CPXY180D2131
# 9 4 CPCY180D2139
Create your data.frame
of line numbers and counts.
with(temp, data.frame(
lines = tapply(line, value, paste, collapse = ", "),
count = tapply(line, value, length)))
# lines count
# CPXY180D2324 1, 4, 4, 4 4
# CPCY180D2139 2, 4 2
# CPXY180D2131 2, 3, 4 3
Anyway, this is purely a guess (and me killing time....)
Upvotes: 4