Yun Hyunsoo
Yun Hyunsoo

Reputation: 71

replace value of a column with another column

replace value of a column with another column I am working on a pretreatment process of a data set. Here is an over view of my dataset

  BSTN ASTN  
  150   216
  300   222
  4310  35
   60  4310 
  150   23
  4310  60
   .     .
   .     . 
   .     .

The numbers BSTN ASTN indicate code number of stations I have a another data set of BSTN,ASTN numbers that show station numbers. Here is what it looks like

  Station code Station  
  150             A
  300             A
  4310            B
   23             C
   60             C 

   .              .
   .              . 
   .              .

The problem is that some stations have two or more codes. for example station code for station A is 150, 300

I would like to unify the stations numbers in BSTN and ASTN which have multiple station numbers into one unified station number. I would also like to unify the numbers of the code in BSTN, ASTN into a number with more frequency

  BSTN ASTN  
  150   216
  150   222
  4310  35
   60  4310 
  150   60
  4310  60
   .     .
   .     . 
   .     .

The final output I want would look like below.

I would appreciate some help

Upvotes: 0

Views: 58

Answers (1)

David T
David T

Reputation: 2143

Let's get in your data

library(readr)    
stations <- read_table(
  "BSTN  ASTN  
  150   216
  300   222
 4310    35
   60  4310 
  150    23
 4310    60")

stCodex <- read_table(
  "Station_code Station  
  150             A
  300             A
  4310            B
   23             C
   60             C "
)

Let's put both columns into a single tibble, so we can find out what's the most popular Station_code for each Station

st2 <- tibble(replaceBy = c(stations$BSTN, stations$ASTN)) %>% 
  left_join(stCodex, by = c("replaceBy" = "Station_code")) %>% 
  filter(!is.na(Station)) %>% 
  group_by(Station, replaceBy) %>% 
  summarise(Count = n()) %>% 
   # We have a count of each Station's Station_codes.  Find the most frequent
  mutate(BestCount = max(Count)) %>% 
  filter(Count == BestCount) %>% 
  summarise_all(~ first(.)) %>% 
  arrange(Station) %>% 
  select(Station, replaceBy)

That looks like:

# A tibble: 3 x 2
  Station replaceBy
  <chr>       <dbl>
1 A             150
2 B            4310
3 C              60

Build a replacement table - to replace the low-freq Station_codes with higher ones.

rplc <- stCodex %>% 
  left_join(st2, by = "Station") %>% 
  filter(Station_code != replaceBy) %>% 
  select(- Station)

Giving

# A tibble: 2 x 2
  Station_code replaceBy
         <dbl>     <dbl>
1          300       150
2           23        60 

Do the replacements - first in BSTN, then in ASTN

stations %>% 
  left_join(rplc, by = c(BSTN ="Station_code")) %>% 
  mutate(BSTN = ifelse(is.na(replaceBy), BSTN, replaceBy)) %>% 
  select(- replaceBy) %>% 
  left_join(rplc, by = c(ASTN ="Station_code")) %>% 
  mutate(ASTN = ifelse(is.na(replaceBy), ASTN, replaceBy)) %>% 
  select(- replaceBy)

The answer is

# A tibble: 6 x 2
   BSTN  ASTN
  <dbl> <dbl>
1   150   216
2   150   222
3  4310    35
4    60  4310
5   150    60
6  4310    60

Upvotes: 1

Related Questions