Reputation: 25
I contain a dataset 'area'
House_No. Info_On_Area
1a Names of neighbouringhouse in 100m 1b 1c 1d 1e
1a Area of neighbouringhouse in 100m 500 1000 1500 300
1a Names of neighbouringhouse in 300m 1b 1c 1d 1e 1f 1g 1h
1a Area of neighbouringhouse in 300m 500 1000 1500 300 600 400 2000
2a Names of neighbouringhouse in 100m 2b 2c 2d 2e
2a Area of neighbouringhouse in 100m 500 1000 1500 300
2a Names of neighbouringhouse in 300m 2b 2c 2d 2e 2f 2g 2h
2a Area of neighbouringhouse in 300m 500 1000 1500 300 600 400 2000
I want to create a dataframe where I can have the table to appear as
House_No. Area of neighbouringhouse in 100m Area of neighbouringhouse in 300m
I used dplyr and grouped the different house numbers
CT <- data.frame(but %>% group_by(House_No.
)) and tried using rowSums. However, I got the error saying that the information is not numeric. I think that is because I need to make the numbers in the row values as numbers and I am not sure of how to do so.
I got stuck at this stage and couldnt proceed further.
I did look into similar solutions but they dont seem to be having a data frame where they are working towards the summing row values such as Sum rows in data.frame or matrix, Sum by Rows in R.
I would be grateful for any help! Thank you :)
Upvotes: 0
Views: 143
Reputation: 42544
The difficulty here is that the information is presented in a mixture of wide and long format. Info_On_Area
is a character column which contains the variable name as well as an arbitrary number of values separated by white space. Therefore, Info_On_Area
needs to be splitted in two steps. First, to extract the variable name and second to extract the numbers for subsequent conversion to numeric and summation.
Fortunately, the OP is only interested in the area information which simplifies matters.
library(dplyr)
library(purrr)
library(stringr)
library(tidyr)
area %>%
filter(Info_On_Area %>% str_detect("^Area")) %>%
separate(Info_On_Area, c("var", "val"), sep = "(?<=00m)") %>%
mutate(Area = map_int(val, ~ str_extract_all(. , "\\d+") %>% unlist() %>% as.integer() %>% sum())) %>%
pivot_wider(id_cols = House_No., names_from = var, values_from = Area)
# A tibble: 2 x 3 House_No. `Area of neighbouringhouse in 100m` `Area of neighbouringhouse in 300m` <chr> <int> <int> 1 1a 3300 6300 2 2a 3300 6300
The result has one row for each House_No.
. This is different to A. Suliman's solution which shows two rows for each (no longer in the edited version of A. Suliman's answer). Other differences include the use of the House_No.
separate()
and pivot_wider()
functions, a regular expression with lookbehind "(?<=00m)"
, and to apply filter()
as the first step in the pipeline.
For the sake of completeness, here is also a data.table
solution:
library(data.table)
library(magrittr)
setDT(area)[Info_On_Area %like% "^Area",
c(.(House_No.= House_No.), tstrsplit(Info_On_Area, "(?<=00m)", perl = TRUE))][
, str_extract_all(V3, "\\d+") %>% unlist() %>% as.integer() %>% sum(), by = .(House_No., V2)][
, dcast(.SD, House_No. ~ V2, value.var = "V1")]
House_No. Area of neighbouringhouse in 100m Area of neighbouringhouse in 300m 1: 1a 3300 6300 2: 2a 3300 6300
area <- structure(list(House_No. = c("1a", "1a", "1a", "1a", "2a", "2a",
"2a", "2a"), Info_On_Area = c("Names of neighbouringhouse in 100m 1b 1c 1d 1e",
"Area of neighbouringhouse in 100m 500 1000 1500 300", "Names of neighbouringhouse in 300m 1b 1c 1d 1e 1f 1g 1h",
"Area of neighbouringhouse in 300m 500 1000 1500 300 600 400 2000",
"Names of neighbouringhouse in 100m 2b 2c 2d 2e", "Area of neighbouringhouse in 100m 500 1000 1500 300",
"Names of neighbouringhouse in 300m 2b 2c 2d 2e 2f 2g 2h",
"Area of neighbouringhouse in 300m 500 1000 1500 300 600 400 2000"
)), class = "data.frame", row.names = c(NA, -8L))
Upvotes: 2
Reputation: 13125
Use stringr::str_extract_*
to retreive the digits then do spread
using pivot_wider
library(tidyverse)
df %>%
#extract everything up to 1+ digits followed by m
mutate(flag = str_extract(Info_On_Area,'.*\\d+m'),
#extract any 1 or more digits followed by space or at the end
SumArea = map_dbl(Info_On_Area, ~sum(as.numeric(str_extract_all(.x, '\\d+(?=\\s|$)', simplify = TRUE))))) %>%
filter(str_detect(Info_On_Area, 'Area')) %>%
#As suggested by @Uwe
pivot_wider(id_cols = House_No., names_from = flag, values_from = SumArea)
# A tibble: 2 x 3
House_No. `Area of neighbouringhouse in 100m` `Area of neighbouringhouse in 300m`
<chr> <dbl> <dbl>
1 1a 3300 6300
2 2a 3300 6300
Data
df <- structure(list(House_No. = c("1a", "1a", "1a", "1a", "2a", "2a",
"2a", "2a"), Info_On_Area = c("Names of neighbouringhouse in 100m 1b 1c 1d 1e",
"Area of neighbouringhouse in 100m 500 1000 1500 300", "Names of neighbouringhouse in 300m 1b 1c 1d 1e 1f 1g 1h",
"Area of neighbouringhouse in 300m 500 1000 1500 300 600 400 2000",
"Names of neighbouringhouse in 100m 2b 2c 2d 2e", "Area of neighbouringhouse in 100m 500 1000 1500 300",
"Names of neighbouringhouse in 300m 2b 2c 2d 2e 2f 2g 2h",
"Area of neighbouringhouse in 300m 500 1000 1500 300 600 400 2000"
)), class = "data.frame", row.names = c(NA, -8L))
Upvotes: 3