fuzzydata
fuzzydata

Reputation: 27

For value in Col 2, replace all values in Col 1 with first occurrence in Col 1

I am cleaning a large dataset and column 1 has several spellings for company name. A company ID in Column 2 is the same across all company spellings, so I would like to search on company ID and replace all company spellings with the first spelling occurrence (it is not that important which one).

I am new to R and data.table syntax, but have tried to iterate through company IDs and use setDT to replace the company name values. I cannot, however, figure out how to replace with the first instance of the company name spelling for only those companies sharing the same company ID.

This is where I am so far:


library(data.table)

#Sample Datatable

CompanyDT <- data.table(Company_Name=c("Froklo","Forklo","Forlko","Corp3","Grap","Garp"), Company_ID=c(1,1,1,2,3,3))

#   Company_Name Company_ID
#1:       Froklo          1
#2:       Forklo          1
#3:       Forlko          1
#4:        Corp3          2
#5:         Grap          3
#6:         Garp          3

Loop I am working on

for(j in CompanyDT[,.(Company_ID)])
  FirstFacName <- CompanyDT[Company_ID[j], Company_Name]
  setDT(CompanyDT)[, Company_Name:=FirstFacName]

I want this result, with the first spelling instance replacing the names of all companies with the same company ID:

#   Company_Name Company_ID
#1:       Froklo          1
#2:       Froklo          1
#3:       Froklo          1
#4:        Corp3          2
#5:         Grap          3
#6:         Grap          3

but I get this result, with the first company name in the table replacing all company names, regardless of company ID:

#  Company_Name Company_ID
#1:       Froklo          1
#2:       Froklo          1
#3:       Froklo          1
#4:       Froklo          2
#5:       Froklo          3
#6:       Froklo          3

Upvotes: 0

Views: 43

Answers (1)

zerocool
zerocool

Reputation: 369

The dplyr way would be:

library(dplyr)
CompanyDT %>% 
  group_by(Company_ID) %>%
  mutate(Company_Name_new = first(Company_Name))

# A tibble: 6 x 3
# Groups:   Company_ID [3]
  Company_Name Company_ID Company_Name_new
  <chr>             <dbl> <chr>           
1 Froklo                1 Froklo          
2 Forkslo               1 Froklo          
3 Forlko                1 Froklo          
4 Corp3                 2 Corp3           
5 Grap                  3 Grap            
6 Garp                  3 Grap 

Upvotes: 1

Related Questions