richardgasquet
richardgasquet

Reputation: 109

Grep is returning NA when it should be returning a string

I have a dataframe called df_parse that looks like this:

     error             error_position
1       0                           
2       0                           
3       0                           
4       1                    24 - 26
5       1                    29 - 30
6       0                           
7       0                           
8       0                           
9       0                           
10      0                           
11      0                           
12      0                           
13      0                           
14      0                           
15      0                           
16      0                           
17      0                           
18      0                           
19      0                           
20      0                           
21      0                           
22      0                           
23      1                    78 - 78
24      0                           
25      1                    83 - 84
26      0                           
27      0                           
28      0                           
29      1                    92 - 92
30      1                    95 - 95
31      0                           
32      0                           
33      0                           
34      0                           
35      0                           
36      0                           
37      1                  111 - 113`

I want to find where the strings in the error_position column are in my raw data (character vector file). This is what the sample raw data looks like:

    HUBUSL1 2   ENTER LINE NUMBER   81 - 82
    FOR HUBUS = 1 VALID ENTRIES
    
    
    83 - 84
    
    VALID ENTRIES
    
    1   MIN VALUE
    99  MAX VALUE
    
    HUBUSL3 2   See BUSL1   85 - 86
    
    VALID ENTRIES
    
    1   MIN VALUE
    99  MAX VALUE
    
    HUBUSL4 2   See BUSL1   87 - 88
    
    VALID ENTRIES
    
    1   MIN VALUE
    99  MAX VALUE
     
    
    
    A2. GEOGRAPHIC INFORMATION
    GEREG   2   REGION  89 - 90
    
    EDITED UNIVERSE:    ALL HHLD's IN SAMPLE VALID ENTRIES
    1   NORTHEAST
    2   MIDWEST (FORMERLY NORTH CENTRAL)
    3   SOUTH
    4   WEST
    
    GEDIV   1   DIVISION    91 - 91
    
    EDITED UNIVERSE:    ALL HHLD's IN SAMPLE VALID ENTRIES








92 – 92


GESTFIPS    2   FEDERAL INFORMATION 93 - 94
PROCESSING STANDARDS (FIPS) STATE CODE

For example, in the error_position column of the df_parse dataframe, row 25 "83 - 84" matches the raw file in the 5th line

FOR HUBUS = 1 VALID ENTRIES


    83 - 84

And similarly "92 – 92" matches towards the end of the sample raw data file:

92 – 92


GESTFIPS    2   FEDERAL INFORMATION 93 - 94

I wrote a for loop that uses grep to return the element positions of the pattern values in "error_position" from the raw data vector.

results1<- vector(mode = "character", length = length(df_parse$error)) #empty vector

for(i in seq_along(df_parse$error)){
    results1[i]<- ifelse(df_parse$error[i] == 1, grep(pattern = paste(df_parse$error_position[i]), x = raw, value = FALSE), "")
    
}

results1 

These are the sample results:

[1] ""     ""     ""     "37"   "95"   ""     ""     ""     ""     ""     ""     ""     ""     ""    
 [15] ""     ""     ""     ""     ""     ""     ""     ""     "288"  ""     "298"  ""     ""     ""    
 [29] NA     "381"  ""     ""     ""     ""     ""     ""     "444"  ""     ""     ""     ""     ""    
 [43] ""     "532"  ""     "551"  ""     ""     ""     ""     NA     ""     ""     "677"  ""     ""    
 [57] "712"  ""     ""     ""     ""     ""     ""     ""     "838"  ""     ""     ""     ""     ""    
 [71] ""     ""     NA     ""     ""     ""     ""     "991"  ""     ""     ""     ""     ""     ""    
 [85] ""     ""     NA     "1140" ""     "1158" ""     ""     ""     ""     ""     ""     ""     ""    
 [99] ""     ""     ""     ""     ""     "1283" ""     ""     ""     NA     ""     ""     ""     ""    
[113] ""     ""     ""     ""     ""     ""     ""     ""     ""     ""     ""     ""     ""     "1658"
[127] ""     ""     ""     NA     ""     ""     "1749" ""     ""     ""     ""     ""     ""     ""    
[141] "1824" ""     ""     ""     ""     ""     ""     ""     ""     ""     ""     ""     ""     ""    
[155] ""     ""     "2065" ""     ""     "2109" ""     ""     ""     ""     ""     "2161" ""     ""    
[169] ""     NA     ""     ""     ""     ""     ""     ""     ""     ""     "2344" ""     ""     ""    

So this is the result that I wanted because it tells me where all the pattern matches occurred within the raw data file, but I noticed there were "NAs"

I found that the NAs were because some of the number ranges don't all have hyphens between them, but long dashes(em dashes). E.g. in the raw data, "92 – 92" (this is a long dash/em dash) and my grep based on the error_position column is currently looking for a regular hyphen such as "24 - 26"

I tried troubleshooting to grep for the long dash/em dash, but it's still returning an NA. For example, I know in my loop results that element 29 in the raw data vector is returning an NA by looking for "92 - 92" instead of "92 – 92" (long dash/em dash).

MY PROBLEM: Yet, when I try to simply grep for the value of "92 – 92", in the raw data file, it is returning NA or rather integer(0)

Some of my tries: grep(pattern = "92 – 92", x = raw, value = FALSE) == integer(0) grep(pattern = paste(df_parse$error_position[29]), x = raw, value = FALSE) == integer(0)

Would love to hear any suggestions. Thanks.

Upvotes: 0

Views: 174

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 389047

How about you look for hyphens as well as the long hyphens ?

You can try -

for(i in seq_along(df_parse$error)){
  
  results1[i]<- if(df_parse$error[i] == 1) {
    pat <- sub('-', '[-–]', df_parse$error_position[i])
    res <- grep(pat, raw) 
    if(length(res)) res[1] else ""
  } else ""
}

Upvotes: 1

Related Questions