Varun
Varun

Reputation: 1321

R transform dataframe by parsing columns

Context

I have created a small sample dataframe to explain my problem. The original one is larger, as it has many more columns. But it is formatted in the same way.

df = data.frame(Case1.1.jpeg.text="the",
                Case1.1.jpeg.text.1="big",
                Case1.1.jpeg.text.2="DOG",
                Case1.1.jpeg.text.3="10197",
                
                Case1.2.png.text="framework",
                
                Case1.3.jpg.text="BE",
                Case1.3.jpg.text.1="THE",
                Case1.3.jpg.text.2="Change",
                Case1.3.jpg.text.3="YOUWANTTO",
                Case1.3.jpg.text.4="SEE",
                Case1.3.jpg.text.5="in",
                Case1.3.jpg.text.6="theWORLD",
                
                Case1.4.png.text="09.80.56.60.77")

The dataframe consists of output from a text detection ML model based on a certain number of input images.

The output format makes each word for each image a separate column, thereby creating a very wide dataset.

Desired Output

I am looking to create a cleaner version of it, with one column containing the image name (e.g. Case1.2.png) and the second with the concatenation of all possible words that the model finds in that particular image (the number of words varies from image to image).

result = data.frame(Case=c('Case1.1.jpeg','Case1.2.png','Case1.3.jpg','Case1.4.png'),
                Text=c('thebigDOG10197','framework','BETHEChangeYOUWANTTOSEEintheWORLD','09.80.56.60.77'))

I have tried many approaches based on similar questions found on Stackoverflow, but none seem to give me the exact output I'm looking for.

Any help on this would be greatly appreciated.

Upvotes: 0

Views: 68

Answers (4)

LMc
LMc

Reputation: 18632

library(tidyr)
library(dplyr)

df %>% 
  pivot_longer(cols = everything(),
               names_pattern = "(.*)\\.(text.*)",
               names_to = c("Case", NA)) %>% 
  group_by(Case) %>% 
  summarize(value = paste(value, collapse = ""), .groups = "drop")

Alternatively, this can be accomplished using just the pivot functions from tidyr:

library(tidyr)
library(stringr)

df %>% 
  pivot_longer(cols = everything(),
               names_pattern = "(.*)\\.(text).*",
               names_to = c("Case", "cols")) %>% 
  pivot_wider(id_cols = Case,
              values_from = value,
              names_from = cols,
              values_fn = str_flatten)

Output

  Case         value                            
  <chr>        <chr>                            
1 Case1.1.jpeg thebigDOG10197                   
2 Case1.2.png  framework                        
3 Case1.3.jpg  BETHEChangeYOUWANTTOSEEintheWORLD
4 Case1.4.png  09.80.56.60.77  

Upvotes: 1

langtang
langtang

Reputation: 24722

You can use pivot_longer(everything()), manipulate the "Case" column, group, and paste together:

pivot_longer(df,everything(),names_to="Case") %>%
  mutate(Case = str_remove_all(Case, ".text.*")) %>%
  group_by(Case) %>% summarize(Text=paste(value, collapse=""))

Output:

  Case         Text                             
  <chr>        <chr>                            
1 Case1.1.jpeg thebigDOG10197                   
2 Case1.2.png  framework                        
3 Case1.3.jpg  BETHEChangeYOUWANTTOSEEintheWORLD
4 Case1.4.png  09.80.56.60.77           

Upvotes: 1

akrun
akrun

Reputation: 887078

An option in base R is stack the data into a two column data.frame with stack and then do a group by paste with aggregate

aggregate(cbind(Text = values) ~ Case, transform(stack(df), 
  Case = trimws(ind, whitespace = "\\.text.*")), FUN = paste, collapse = "")
          Case                              Text
1 Case1.1.jpeg                    thebigDOG10197
2  Case1.2.png                         framework
3  Case1.3.jpg BETHEChangeYOUWANTTOSEEintheWORLD
4  Case1.4.png                    09.80.56.60.77

Upvotes: 1

PaulS
PaulS

Reputation: 25323

A possible solution:

library(tidyverse)

df %>% 
  pivot_longer(everything()) %>% 
  mutate(name = str_remove(name, "\\.text\\.*\\d*")) %>% 
  group_by(name) %>% 
  summarise(text = str_c(value, collapse = ""))

#> # A tibble: 4 x 2
#>   name         text                             
#>   <chr>        <chr>                            
#> 1 Case1.1.jpeg thebigDOG10197                   
#> 2 Case1.2.png  framework                        
#> 3 Case1.3.jpg  BETHEChangeYOUWANTTOSEEintheWORLD
#> 4 Case1.4.png  09.80.56.60.77

Upvotes: 1

Related Questions