Reputation: 109
I have a dataframe called df_parse that looks like this:
error error_position
1 0
2 0
3 0
4 1 24 - 26
5 1 29 - 30
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 0
23 1 78 - 78
24 0
25 1 83 - 84
26 0
27 0
28 0
29 1 92 - 92
30 1 95 - 95
31 0
32 0
33 0
34 0
35 0
36 0
37 1 111 - 113`
I want to find where the strings in the error_position column are in my raw data (character vector file). This is what the sample raw data looks like:
HUBUSL1 2 ENTER LINE NUMBER 81 - 82
FOR HUBUS = 1 VALID ENTRIES
83 - 84
VALID ENTRIES
1 MIN VALUE
99 MAX VALUE
HUBUSL3 2 See BUSL1 85 - 86
VALID ENTRIES
1 MIN VALUE
99 MAX VALUE
HUBUSL4 2 See BUSL1 87 - 88
VALID ENTRIES
1 MIN VALUE
99 MAX VALUE
A2. GEOGRAPHIC INFORMATION
GEREG 2 REGION 89 - 90
EDITED UNIVERSE: ALL HHLD's IN SAMPLE VALID ENTRIES
1 NORTHEAST
2 MIDWEST (FORMERLY NORTH CENTRAL)
3 SOUTH
4 WEST
GEDIV 1 DIVISION 91 - 91
EDITED UNIVERSE: ALL HHLD's IN SAMPLE VALID ENTRIES
92 – 92
GESTFIPS 2 FEDERAL INFORMATION 93 - 94
PROCESSING STANDARDS (FIPS) STATE CODE
For example, in the error_position column of the df_parse dataframe, row 25 "83 - 84" matches the raw file in the 5th line
FOR HUBUS = 1 VALID ENTRIES
83 - 84
And similarly "92 – 92" matches towards the end of the sample raw data file:
92 – 92
GESTFIPS 2 FEDERAL INFORMATION 93 - 94
I wrote a for loop that uses grep to return the element positions of the pattern values in "error_position" from the raw data vector.
results1<- vector(mode = "character", length = length(df_parse$error)) #empty vector
for(i in seq_along(df_parse$error)){
results1[i]<- ifelse(df_parse$error[i] == 1, grep(pattern = paste(df_parse$error_position[i]), x = raw, value = FALSE), "")
}
results1
These are the sample results:
[1] "" "" "" "37" "95" "" "" "" "" "" "" "" "" ""
[15] "" "" "" "" "" "" "" "" "288" "" "298" "" "" ""
[29] NA "381" "" "" "" "" "" "" "444" "" "" "" "" ""
[43] "" "532" "" "551" "" "" "" "" NA "" "" "677" "" ""
[57] "712" "" "" "" "" "" "" "" "838" "" "" "" "" ""
[71] "" "" NA "" "" "" "" "991" "" "" "" "" "" ""
[85] "" "" NA "1140" "" "1158" "" "" "" "" "" "" "" ""
[99] "" "" "" "" "" "1283" "" "" "" NA "" "" "" ""
[113] "" "" "" "" "" "" "" "" "" "" "" "" "" "1658"
[127] "" "" "" NA "" "" "1749" "" "" "" "" "" "" ""
[141] "1824" "" "" "" "" "" "" "" "" "" "" "" "" ""
[155] "" "" "2065" "" "" "2109" "" "" "" "" "" "2161" "" ""
[169] "" NA "" "" "" "" "" "" "" "" "2344" "" "" ""
So this is the result that I wanted because it tells me where all the pattern matches occurred within the raw data file, but I noticed there were "NAs"
I found that the NAs were because some of the number ranges don't all have hyphens between them, but long dashes(em dashes). E.g. in the raw data, "92 – 92" (this is a long dash/em dash) and my grep based on the error_position column is currently looking for a regular hyphen such as "24 - 26"
I tried troubleshooting to grep for the long dash/em dash, but it's still returning an NA. For example, I know in my loop results that element 29 in the raw data vector is returning an NA by looking for "92 - 92" instead of "92 – 92" (long dash/em dash).
MY PROBLEM: Yet, when I try to simply grep for the value of "92 – 92", in the raw data file, it is returning NA or rather integer(0)
Some of my tries: grep(pattern = "92 – 92", x = raw, value = FALSE) == integer(0) grep(pattern = paste(df_parse$error_position[29]), x = raw, value = FALSE) == integer(0)
Would love to hear any suggestions. Thanks.
Upvotes: 0
Views: 174
Reputation: 389047
How about you look for hyphens as well as the long hyphens ?
You can try -
for(i in seq_along(df_parse$error)){
results1[i]<- if(df_parse$error[i] == 1) {
pat <- sub('-', '[-–]', df_parse$error_position[i])
res <- grep(pat, raw)
if(length(res)) res[1] else ""
} else ""
}
Upvotes: 1