Reputation: 59
I have a character vector of file paths which always contain the names of companies in them. I also have a data frame with the a column that contains the company name.
I want to be able to check firstly that the row contains the value 'Title' in the df$style_name column. Then I want to see if the company name from the data frame is in the filepath from the character vector.
If so, assign a new column df$record to contain the corresponding filepath.
This is the character vector.
filenames <- list.files(path = dir, pattern = "*.docx|*.DOCX", full.names = TRUE)
> filenames
[1] "C:/Temp/data/D21 248694 Company Data - ABC Co - August 2021.DOCX"
[2] "C:/Temp/data/D21 248706 Company Data – XYZ Limited – September 2021.DOCX"
The data frame I currently have.
style_name | text | record |
---|---|---|
Title | ABC Co | NA |
List Bullet | blah blah | NA |
List Bullet | blah blah | NA |
Title | XYZ Limited | NA |
List Bullet | blah blah | NA |
The data frame I am after.
style_name | text | record |
---|---|---|
Title | ABC Co | C:/Temp/data/D21 248694 Company Data - ABC Co - August 2021.DOCX |
List Bullet | blah blah | NA |
List Bullet | blah blah | NA |
Title | XYZ Limited | C:/Temp/data/D21 248706 Company Data – XYZ Limited – September 2021.DOCX |
List Bullet | blah blah | NA |
This is my code currently, I think the for loop is wrong because it only populates the last row that matches the last filepath in the vector.
for (file in filenames) {
df$record <- ifelse((df$style_name == 'Title' & str_detect(tolower(file),tolower(df$text))), file, NA)
}
Upvotes: 0
Views: 1442
Reputation: 9858
We can use dplyr
, tidyr
, stringr
, and purrr
(basically the entire tidyverse).
library(tidyverse)
df %>% mutate(record=ifelse(style_name=='Title',
map(text, ~filenames[str_detect(filenames, .x)]),
NA))%>%
unnest(cols=record, keep_empty = TRUE)
# A tibble: 5 x 3
style_name text record
<chr> <chr> <chr>
1 Title ABC Co C:/Temp/data/D21 248694 Company Data - ABC Co - August 2021.DOCX
2 List Bullet blah blah NA
3 List Bullet blah blah NA
4 Title XYZ C:/Temp/data/D21 248706 Company Data – XYZ Limited – September 2021.DOCX
5 List Bullet blah blah NA
Upvotes: 0
Reputation: 388862
You can try -
#Initialise record column to NA
df$record <- NA
#get the row numbers where style_name is 'Title'
inds <- which(df$style_name == 'Title')
#For each index find the corresponding filenames which matches.
for(i in inds) {
val <- grep(df$text[i], filenames, value = TRUE)
if(length(val)) df$record[i] <- val[1]
}
df
# style_name text record
#1 Title ABC Co C:/Temp/data/D21 248694 Company Data - ABC Co - August 2021.DOCX
#2 List Bullet blah blah <NA>
#3 List Bullet blah blah <NA>
#4 Title XYZ Limited C:/Temp/data/D21 248706 Company Data – XYZ Limited – September 2021.DOCX
#5 List Bullet blah blah <NA>
Upvotes: 1
Reputation: 325
Try this:
# your columns
style_name = c("Title" ,"List Bullet" ,"List Bullet" ,"Title" ,"List Bullet" )
text = c("ABC Co" ,"blah blah" ,"blah blah" ,"XYZ" ,"blah blah" )
# The filenames
filenames = c("C:/Temp/data/D21 248694 Company Data - ABC Co - August 2021.DOCX"
,"C:/Temp/data/D21 248706 Company Data – XYZ Limited – September 2021.DOCX")
# create the data frame
df = data.frame(style_name,text)
# Create recods column
df$record = NA
# The for loop
for(i in 1:nrow(df)){
df$record[i] = ifelse(sum(grepl(df$text[i],filenames)) >0 ,filenames[grepl(df$text[i],filenames)], NA)
}
grepl
detects if a string is substring of another string.
if a string sss
is substring of any string in an array (vector) of strings SSS
then sum(grepl(sss,SSS))
> 0.
Upvotes: 0