Reputation: 95
I have the following data frame, which was sourced from a text file:
Account_No Title Date
52683 DESIGN IN THE TERRAIN OF WATER / MATHUR,ANURADHA. 8/03/2019
6224 KABIR IN MALWA 29/04/2015
25801 A VILLAGE IS A BUSY PLACE / GEETHA,V. 5/06/2020
11439 KABIR IN AMERICA 29/04/2015
25802 A VILLAGE IS A BUSY PLACE / GEETHA,V. 5/06/2020
7843 IN EVERY BODY KABIR 29/04/2015
13013 MOBY-DICK : A POP-UP BOOK / ITA,SAM. 22/01/2020
38110 DEVELOPING SUBJECT KNOWLEDGE IN DESIGN AND 29/11/2010
TECHNOLOGY : STRUCTURES / OWEN-JACKSON, GWYNETH &
MYERSO.
38118 SCIENCE COMMUNICATION IN THEORY AND PRACTICE 6/12/2010
/ STOCKLMAYER, SUSAN M Et al (ED.
7844 KABIR IN THUMRI 29/04/2015
25042 TRAVELLING CONCEPTS IN THE HUMANITIES - A ROUGH 13/04/2018
GUIDE / BAL,MIEKE.
001.3 BAL
In excel, I used Data to Columns to create the following table using a fixed width delimiter:
Account_No | Title | Date |
---|---|---|
52683 | DESIGN IN THE TERRAIN OF WATER / MATHUR,ANURADHA. | 08-03-2019 |
6224 | KABIR IN MALWA | 29-04-2015 |
25801 | A VILLAGE IS A BUSY PLACE / GEETHA,V. | 05-06-2020 |
11439 | KABIR IN AMERICA | 29-04-2015 |
25802 | A VILLAGE IS A BUSY PLACE / GEETHA,V. | 05-06-2020 |
7843 | IN EVERY BODY KABIR | 29-04-2015 |
13013 | MOBY-DICK : A POP-UP BOOK / ITA,SAM. | 22-01-2020 |
38110 | DEVELOPING SUBJECT KNOWLEDGE IN DESIGN AND | 29-11-2010 |
TECHNOLOGY : STRUCTURES / OWEN-JACKSON, GWYNETH & | ||
MYERSO. | ||
38118 | SCIENCE COMMUNICATION IN THEORY AND PRACTICE | 06-12-2010 |
/ STOCKLMAYER, SUSAN M Et al (ED. | ||
7844 | KABIR IN THUMRI | 29-04-2015 |
25042 | TRAVELLING CONCEPTS IN THE HUMANITIES - A ROUGH | 13-04-2018 |
GUIDE / BAL,MIEKE. | ||
001.3 BAL | ||
24655 | EVALUTING RESEARCH : METHODOLOGY FOR PEOPLE WHO | 05-11-2019 |
NEED TO READ RESEARCH / DANE,FRANCIS C. | ||
001.4 DAN | ||
30170 | CASE STUDY RESEARCH METHODS / GILLHAM,BILL. | 03-11-2011 |
The problem with this is that the some books have their titles in multiple rows (such as "DEVELOPING SUBJECT KNOWLEDGE IN DESIGN AND...", where account number is 38110), whereas they should be in a single row with the corresponding Account_No.
How can I do this?
Dput of the data:
structure(list(Account_No = c("52683", "6224", "25801", "11439",
"25802", "7843", "13013", "38110", "", "", "38118", "", "7844",
"25042", "", "", "24655", "", "", "30170"), Title = c("DESIGN IN THE TERRAIN OF WATER / MATHUR,ANURADHA.",
"KABIR IN MALWA", "A VILLAGE IS A BUSY PLACE / GEETHA,V.", "KABIR IN AMERICA",
"A VILLAGE IS A BUSY PLACE / GEETHA,V.", "IN EVERY BODY KABIR",
"MOBY-DICK : A POP-UP BOOK / ITA,SAM.", "DEVELOPING SUBJECT KNOWLEDGE IN DESIGN AND",
"TECHNOLOGY : STRUCTURES / OWEN-JACKSON, GWYNETH &", "MYERSO.",
"SCIENCE COMMUNICATION IN THEORY AND PRACTICE", "/ STOCKLMAYER, SUSAN M Et al (ED.",
"KABIR IN THUMRI", "TRAVELLING CONCEPTS IN THE HUMANITIES - A ROUGH",
"GUIDE / BAL,MIEKE.", "001.3 BAL", "EVALUTING RESEARCH : METHODOLOGY FOR PEOPLE WHO",
"NEED TO READ RESEARCH / DANE,FRANCIS C.", "001.4 DAN", "CASE STUDY RESEARCH METHODS / GILLHAM,BILL."
), Date = c("08-03-2019", "29-04-2015", "05-06-2020", "29-04-2015",
"05-06-2020", "29-04-2015", "22-01-2020", "29-11-2010", "", "",
"06-12-2010", "", "29-04-2015", "13-04-2018", "", "", "05-11-2019",
"", "", "03-11-2011")), row.names = c(NA, 20L), class = "data.frame")
``
Upvotes: 0
Views: 87
Reputation: 78947
Here is how you can do it with tidyverse Explanation: First change blank cells of Account_No to NA then fill the NA with value above until next value then group by Account_No then combine the rows of Title to one row after that just tweak - > to get desired output
library(tidyverse)
df1 <- df %>%
mutate_at(vars(colnames(.)[names(.) %in% "Account_No"]),
.funs = funs(ifelse(.=="", NA, as.character(.)))) %>%
fill(Account_No) %>%
group_by(Account_No) %>%
summarise(Title = paste(Title, collapse = " ")) %>%
ungroup () %>%
left_join(df, df1, by="Account_No") %>%
select (Account_No, Title = Title.x, Date, -Title.y)
Upvotes: 1
Reputation: 906
There might be some pretty way to do this, but this works. I'm running a for
loop but somebody can help me replace it with something more efficient.
The logic behind is, first I'll convert empty strings ""
to NA
, then the column number here needed to be modified is the second column. So I'm running a for
loop from the last row (nrow
will give me the last row number) to the second row i.e I'll do the operation backwards from the last row. So lets say the row number is 10, I'll look for the same row in the n-1
th column which here is 2-1
is 1
. If in the 1st column has NA
for that row then I check the second column the same row and if it is not NA
that means the text is in the next line, so I'll take the text and combine it into the previous row of the second column and I repeat this process until I reach the second row.
It's little tricky for me to explain this in text but the concept is simple.
library(tidyverse)
df <- structure(list(Account_No = c("52683", "6224", "25801", "11439",
"25802", "7843", "13013", "38110", "", "", "38118", "", "7844",
"25042", "", "", "24655", "", "", "30170"), Title = c("DESIGN IN THE TERRAIN OF WATER / MATHUR,ANURADHA.",
"KABIR IN MALWA", "A VILLAGE IS A BUSY PLACE / GEETHA,V.", "KABIR IN AMERICA",
"A VILLAGE IS A BUSY PLACE / GEETHA,V.", "IN EVERY BODY KABIR",
"MOBY-DICK : A POP-UP BOOK / ITA,SAM.", "DEVELOPING SUBJECT KNOWLEDGE IN DESIGN AND",
"TECHNOLOGY : STRUCTURES / OWEN-JACKSON, GWYNETH &", "MYERSO.",
"SCIENCE COMMUNICATION IN THEORY AND PRACTICE", "/ STOCKLMAYER, SUSAN M Et al (ED.",
"KABIR IN THUMRI", "TRAVELLING CONCEPTS IN THE HUMANITIES - A ROUGH",
"GUIDE / BAL,MIEKE.", "001.3 BAL", "EVALUTING RESEARCH : METHODOLOGY FOR PEOPLE WHO",
"NEED TO READ RESEARCH / DANE,FRANCIS C.", "001.4 DAN", "CASE STUDY RESEARCH METHODS / GILLHAM,BILL."
), Date = c("08-03-2019", "29-04-2015", "05-06-2020", "29-04-2015",
"05-06-2020", "29-04-2015", "22-01-2020", "29-11-2010", "", "",
"06-12-2010", "", "29-04-2015", "13-04-2018", "", "", "05-11-2019",
"", "", "03-11-2011")), row.names = c(NA, 20L), class = "data.frame")
roller_coaster <- function(df, col_numb){
for(i in nrow(df):2){
if(is.na(df[i,(col_numb-1)])){
if(!is.na(df[i,(col_numb)])){
paste(df[i-1,col_numb], df[i,col_numb], sep = ' ') -> df[i-1,col_numb]
NA -> df[i,col_numb]
}
}
}
df
}
df %>%
as_tibble() %>%
mutate(across(everything(), na_if, "")) %>%
roller_coaster(2) %>%
drop_na()
#> # A tibble: 13 x 3
#> Account_No Title Date
#> <chr> <chr> <chr>
#> 1 52683 DESIGN IN THE TERRAIN OF WATER / MATHUR,ANURADHA. 08-03-2…
#> 2 6224 KABIR IN MALWA 29-04-2…
#> 3 25801 A VILLAGE IS A BUSY PLACE / GEETHA,V. 05-06-2…
#> 4 11439 KABIR IN AMERICA 29-04-2…
#> 5 25802 A VILLAGE IS A BUSY PLACE / GEETHA,V. 05-06-2…
#> 6 7843 IN EVERY BODY KABIR 29-04-2…
#> 7 13013 MOBY-DICK : A POP-UP BOOK / ITA,SAM. 22-01-2…
#> 8 38110 DEVELOPING SUBJECT KNOWLEDGE IN DESIGN AND TECHNOLOGY : … 29-11-2…
#> 9 38118 SCIENCE COMMUNICATION IN THEORY AND PRACTICE / STOCKLMAY… 06-12-2…
#> 10 7844 KABIR IN THUMRI 29-04-2…
#> 11 25042 TRAVELLING CONCEPTS IN THE HUMANITIES - A ROUGH GUIDE / … 13-04-2…
#> 12 24655 EVALUTING RESEARCH : METHODOLOGY FOR PEOPLE WHO NEED TO … 05-11-2…
#> 13 30170 CASE STUDY RESEARCH METHODS / GILLHAM,BILL. 03-11-2…
Created on 2021-02-11 by the reprex package (v0.3.0)
Upvotes: 1