Reputation: 1
I am constructing a panel dataset from an original data which contains for each row, a company (name) and its sales across 10 years.
In concrete, it loos like this:
The panel dataset I am building has to look like this:
So far, I have the panel dataset with all companies but only with the years when they have sales.
For each of the company that has stopped sales after showing positive sales (there is a "-" in year y after sales in years x, x+1) I need to add a row copying the info about the company (the whole row: name, sales, year) and add a 1 in the column "country exit". In the example above, I would have to do what has been done for company D in the last row in the second picture.
How can I avoid doing that manually on R studio, as there are approximately 250 companies with this case in the dataset?
Thanks
I've tried some functions on r but unable to perform it in a simple way and make it easy to do for each data.
Upvotes: 0
Views: 276
Reputation: 1136
This is an example using tidyverse
. Let's say this is your dataframe:
df <- structure(list(company_name = c("Company A", "Company B", "Company C",
"Company D"), `2004` = c(NA, NA, NA, NA), `2005` = c(NA, NA,
NA, NA), `2006` = c(NA, NA, NA, NA), `2007` = c(NA, NA, NA, NA
), `2008` = c(NA, NA, NA, NA), `2009` = c(NA, NA, NA, NA), `2010` = c(NA,
NA, NA, NA), `2011` = c(NA, NA, NA, NA), `2012` = c(0, NA, 0.2,
0.1), `2013` = c(0, 0.1, 0.3, NA), `2014` = c(0, 0.1, 0.5, NA
)), class = "data.frame", row.names = c(NA, -4L))
> df
company_name 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
1 Company A NA NA NA NA NA NA NA NA 0.0 0.0 0.0
2 Company B NA NA NA NA NA NA NA NA NA 0.1 0.1
3 Company C NA NA NA NA NA NA NA NA 0.2 0.3 0.5
4 Company D NA NA NA NA NA NA NA NA 0.1 NA NA
First thing to do is applying pivot_longer
on all columns but company name
to create the year
column:
library(tidyverse)
df2 <- df %>% pivot_longer(-company_name, names_to = 'year', values_to = 'sales')
> df2
# A tibble: 44 x 3
company_name year sales
<chr> <chr> <dbl>
1 Company A 2004 NA
2 Company A 2005 NA
3 Company A 2006 NA
4 Company A 2007 NA
5 Company A 2008 NA
6 Company A 2009 NA
7 Company A 2010 NA
8 Company A 2011 NA
9 Company A 2012 0
10 Company A 2013 0
# ... with 34 more rows
Make sure the dataframe is arranged correctly (by company_name
and year
), group_by
company name, and for each company check: if sales
in a row equals NA
, but in the previous row it's larger than 0, put 1 in the new column Country Exit
:
df2 <- df2 %>%
arrange(company_name, year) %>%
group_by(company_name) %>%
mutate(`Country Exit` = ifelse(is.na(sales)&lag(sales) != 0, 1, 0))
> df2
# A tibble: 44 x 4
# Groups: company_name [4]
company_name year sales `Country Exit`
<chr> <chr> <dbl> <dbl>
1 Company A 2004 NA NA
2 Company A 2005 NA NA
3 Company A 2006 NA NA
4 Company A 2007 NA NA
5 Company A 2008 NA NA
6 Company A 2009 NA NA
7 Company A 2010 NA NA
8 Company A 2011 NA NA
9 Company A 2012 0 0
10 Company A 2013 0 0
# ... with 34 more rows
And to get the cleaner output, like the one you are mentioning, just remove NA
s from Country Exit
:
df2 %>% filter(!is.na(`Country Exit`))
# A tibble: 10 x 4
# Groups: company_name [4]
company_name year sales `Country Exit`
<chr> <chr> <dbl> <dbl>
1 Company A 2012 0 0
2 Company A 2013 0 0
3 Company A 2014 0 0
4 Company B 2013 0.1 0
5 Company B 2014 0.1 0
6 Company C 2012 0.2 0
7 Company C 2013 0.3 0
8 Company C 2014 0.5 0
9 Company D 2012 0.1 0
10 Company D 2013 NA 1
Upvotes: 0