Search multiple columns for string

Question

I have a large dataframe (approximately 40000 x 500) that I read from a .CSV file. The dataset contains a list of error code combinations, with each combination specifically listing all the unique codes. I'm searching for the rows that contain a certain piece of string, like "name=", to see if it is known who listed that combination. The problem is that I do not know in which column this may occur, since the 10th column and onwards are irregular per row, as the number of error codes per combination is irregular. This means that after some point the rows are empty, except for the biggest error code combination. I only know that if the row contains the piece of string I'm looking for, it will be in one of the last 6 columns of a row containing any information. If the piece of string I'm looking for is found in a row, I would like an answer in a different column.

I have extensively looked for a solution, but I couldn't find anything that serves my needs. Also, I'm new to R.

[Fake dataset]

  Year  Problem_ID  Number.of.errors  X      X.1    X.2       X.3       X.4
1 2005  Server      2                 X-4555 X-1222 name=Sara
2 2011  Hardware    3                 X-8922 X-7644 X-6485
3 2015  Software    4                 X-8494 X-4321 X-7452    X-5321   name=John
4 2018  Hardware    2                 X-6901 X-1121  
5 2007  Server      3                 X-7655 X-6555 X-8999    name=Sara

[Desired output]

  Year  Problem_ID  Number.of.errors Name X      X.1    X.2       X.3       X.4
1 2005  Server      2                1    X-4555 X-1222 name=Sara
2 2011  Hardware    3                0     X-8922 X-7644 X-6485
3 2015  Software    4                1     X-8494 X-4321 X-7452    X-5321   name=John
4 2018  Hardware    2                0     X-6901 X-1121  
5 2007  Server      3                1     X-7655 X-6555 X-8999    name=Sara

MKR · Accepted Answer

Option #1:

You can use apply as suggested by @RomanLuštrik to get name column. One has to check if any column in that row has matching text to name= using any and grep. The solution will be like:

df$name <- apply(df, 1, function(x)as.integer(any(grep("name=",x))))

df
#   Year Problem_ID Number.of.errors      X    X.1       X.2       X.3       X.4 name
# 1 2005     Server                2 X-4555 X-1222 name=Sara                1
# 2 2011   Hardware                3 X-8922 X-7644    X-6485                0
# 3 2015   Software                4 X-8494 X-4321    X-7452    X-5321 name=John    1
# 4 2018   Hardware                2 X-6901 X-1121                      0
# 5 2007     Server                3 X-7655 X-6555    X-8999 name=Sara          1

Option#2: In case OP is interested in only rows which contains name= text then it would be user csv line-wise and then filter out rows containing name=.

  All_lines <- readLines(file_name)

  desired_lines <- grep("name=", All_lines, value = TRUE)

The desired_lines got only those rows which contains name=.

Data:

df <- read.table(text = 
"Year  Problem_ID  Number.of.errors  X      X.1    X.2       X.3       X.4
1 2005  Server      2                 X-4555 X-1222 name=Sara NA       NA
2 2011  Hardware    3                 X-8922 X-7644 X-6485    NA       NA
3 2015  Software    4                 X-8494 X-4321 X-7452    X-5321   name=John
4 2018  Hardware    2                 X-6901 X-1121 NA       NA       NA 
5 2007  Server      3                 X-7655 X-6555 X-8999    name=Sara    NA",
header = TRUE, stringsAsFactors = FALSE)

Search multiple columns for string

Answers (2)

Related Questions