Nagarjun Srinivas
Nagarjun Srinivas

Reputation: 25

R data manipulation. Finding a column value and manipulating

I have a data frame like this, where column 1 has different values and ends with a country.

Col1     col2  Col3  col4 
    A        0    0     1
    B        1    0     3 
    c        4    0     6
    D        5    6     7
    China    na   na    na 
    A        0    1     3
    B        2    4     5
    C        3    5     6
    D        1     2    3 
    E        5    3     3 
    England  na    na   na 

I want to achieve something like the following, so I need to create a new column with the country name manipulated until the row where the country name is is matched.

Col1     col2  Col3  col4 col5
A        0    0     1     China 
B        1    0     3     china 
c        4    0     6     china 
D        5    6     7     china 
China    na   na    na    china 
A        0    1     3     England
B        2    4     5     England
C        3    5     6     England
D        1    2    3      England 
E        5    3     3     England
England  na    na   na    England

...plus I have 40 other countries to work with. I am new to R and struggling with on how to achieve my desired result.

Upvotes: 2

Views: 109

Answers (4)

akrun
akrun

Reputation: 887088

We can use replace with na.locf

library(zoo)
df1$Col5 <- with(df1, na.locf(replace(Col1, nchar(Col1)==1, NA), fromLast=TRUE))
df1$Cpl5
#[1] "China"   "China"   "China"   "China"   "China"   "England" "England"
#[8] "England" "England" "England" "England"

Or another option is data.table

library(data.table)
setDT(df1)[, Col5 := Col1[.N], cumsum(shift(nchar(Col1)>1, fill = TRUE))]
df1
#       Col1 col2 Col3 col4    Col5
# 1:       A    0    0    1   China
# 2:       B    1    0    3   China
# 3:       c    4    0    6   China
# 4:       D    5    6    7   China
# 5:   China   na   na   na   China
# 6:       A    0    1    3 England
# 7:       B    2    4    5 England
# 8:       C    3    5    6 England
# 9:       D    1    2    3 England
#10:       E    5    3    3 England
#11: England   na   na   na England

Upvotes: 0

Chirayu Chamoli
Chirayu Chamoli

Reputation: 2076

Another hacky method: Using Col1 as starting point, make all the observations having a single character NA. Then fill the NAs with the last value.

library(zoo)

df$col5 = df$Col1
df$col5 = ifelse(nchar(as.character(df$col5)) > 1, as.character(df$col5), NA)
df$col5 = na.locf(as.character(df$col5), fromLast = TRUE)

Upvotes: 1

alistaire
alistaire

Reputation: 43334

A tidyverse option:

library(tidyverse)

       # convert all to appropriate types
df %>% mutate_all(as.character) %>% type_convert(na = 'na') %>% 
    # add column with Col1 value if a row only has one non-NA value, else NA
    mutate(country = ifelse(rowSums(!is.na(.)) == 1, Col1, NA)) %>% 
    fill(country, .direction = 'up') %>%    # replace NAs upwards with last non-NA value
    filter(complete.cases(.))    # subset to rows with no NAs

##   Col1 col2 Col3 col4 country
## 1    A    0    0    1   China
## 2    B    1    0    3   China
## 3    c    4    0    6   China
## 4    D    5    6    7   China
## 5    A    0    1    3 England
## 6    B    2    4    5 England
## 7    C    3    5    6 England
## 8    D    1    2    3 England
## 9    E    5    3    3 England

Upvotes: 1

Ista
Ista

Reputation: 10437

Start by creating the data structure (in the future you should provide code for doing this in your questions).

exd <- read.table(text = "Col1     col2  Col3  col4 
    A        0    0     1
    B        1    0     3 
    c        4    0     6
    D        5    6     7
    China    NA   NA    NA 
    A        0    1     3
    B        2    4     5
    C        3    5     6
    D        1     2    3 
    E        5    3     3 
    England  NA    NA   NA", header = TRUE)

Next, figure out which rows are the country boundaries, and extract the country names

country_boundary <- nchar(as.character(exd$Col1)) > 1
country_names <- exd$Col1[country_boundary]

Then generate codes for each country, shifting by one.

country_number <- c(0, cumsum(country_boundary)[-nrow(exd)])

Finally, use the country_number and country_names vectors to create the country ID column.

exd <- data.frame(exd,
                  Col5 = factor(country_number,
                                labels = country_names))

You may wish to remove the (now redundant) boundary rows:

exd <- exd[!country_boundary, ]
exd
#    Col1 col2 Col3 col4    Col5
# 1     A    0    0    1   China
# 2     B    1    0    3   China
# 3     c    4    0    6   China
# 4     D    5    6    7   China
# 6     A    0    1    3 England
# 7     B    2    4    5 England
# 8     C    3    5    6 England
# 9     D    1    2    3 England
# 10    E    5    3    3 England

Upvotes: 3

Related Questions