Yash
Yash

Reputation: 31

Convert data with one column and multiple rows into multi column multi row data

I have an output of web scraped data in R which looks as follows

Name1
Email: [email protected]
City/Town: Location1
Name2
Email: [email protected]
City/Town: Location2
Name3
Email: [email protected]
City/Town: Location3

Some names may not have email or location. I want convert above data into tabular format. The output should look like

Name      Email           City/Town
Name1   [email protected]  Location1
Name2   [email protected]  Location2
Name3   [email protected]  Location3
Name4                   Location4
Name5   [email protected]  

Upvotes: 3

Views: 263

Answers (5)

G. Grothendieck
G. Grothendieck

Reputation: 269371

Insert \nName: before each Name and then read it in using read.dcf (If the data comes from a file replace textConnection(Lines) with the filename, e.g. "myfile.dat", in the first line of code.) No packages are used.

L <- trimws(readLines(textConnection(Lines)))
ix <- !grepl(":", L)
L[ix] <- paste("\nName:", L[ix])
read.dcf(textConnection(L))

giving the following using the input in the Note at the end:

     Name    Email            City/Town  
[1,] "Name1" "[email protected]" "Location1"
[2,] "Name2" NA               "Location2"
[3,] "Name3" "[email protected]" NA         

Note: Input used. This is slightly modified from question to show that it works if Email or City/Town is missing:

Lines <- "Name1
Email: [email protected]
City/Town: Location1
Name2
City/Town: Location2
Name3
Email: [email protected]"

Upvotes: 4

Axeman
Axeman

Reputation: 35177

Benchmarks:

Code:

txt2 <- c("John Doe", "Email: [email protected]", "Email: [email protected]", "City/Town: Location1", "Save the World Fund", 
          "Email: [email protected]", "City/Town: Location2", "Best Shoes Ltd.", "Email: [email protected]", 
          "City/Town: Location3", "Mother", "City/Town: Location4", "City/Town: everywhere", "Jane", 
          "Email: [email protected]")

library(microbenchmark)
library(data.table)
library(dplyr)
library(tidyr)

microbenchmark(ans.uwe = data.table(text = txt2)[, tstrsplit(text, ": ")
                                                 ][is.na(V2), Name := V1
                                                   ][, Name := zoo::na.locf(Name)
                                                     ][!is.na(V2), dcast(.SD, Name ~ V1, fun = toString, value.var = "V2")],
               ans.zx8754 = data_frame(txt = txt2) %>% 
                 mutate(txt = ifelse(grepl(":", txt), txt, paste("Name:", txt)),
                        rn = row_number()) %>% 
                 separate(txt, into = c("mytype", "mytext"), sep = ":") %>% 
                 spread(key = mytype, value = mytext) %>% 
                 select(-rn) %>% 
                 fill(Name) %>% 
                 group_by(Name) %>% 
                 fill(1:2, .direction = "down") %>% 
                 fill(1:2, .direction = "up") %>% 
                 unique() %>% 
                 ungroup() %>% 
                 select(3:1),
               ans.jaap = data.table(txt = txt2)[!grepl(':', txt), name := txt
                                                 ][, name := zoo::na.locf(name)
                                                   ][grepl('^Email:', txt), email := sub('Email: ','',txt)
                                                     ][grepl('^City/Town:', txt), city_town := sub('City/Town: ','',txt)
                                                       ][txt != name, lapply(.SD, function(x) toString(na.omit(x))), by = name, .SDcols = c('email','city_town')],
               ans.G.Grothendieck = {
                 L <- trimws(readLines(textConnection(txt2)))
                 ix <- !grepl(":", L)
                 L[ix] <- paste("\nName:", L[ix])
                 read.dcf(textConnection(L))},
               times = 1000)

Result:

Unit: microseconds
               expr       min         lq       mean     median        uq        max neval  cld
            ans.uwe  4243.754  4885.4765  5305.8688  5139.0580  5390.360  92604.820  1000   c 
         ans.zx8754 39683.911 41771.2925 43940.7646 43168.4870 45291.504 130965.088  1000    d
           ans.jaap  2153.521  2488.0665  2788.8250  2640.1580  2773.150  91862.177  1000  b  
 ans.G.Grothendieck   266.268   304.0415   332.6255   331.8375   349.797    721.261  1000 a   

Upvotes: 2

zx8754
zx8754

Reputation: 56004

Using dplyr and tidyr, tested on both data provided by @Jaap txt, and by @UweBlock txt1:

library(dplyr)
library(tidyr)

# data_frame(txt = txt1) %>%     
data_frame(txt = txt) %>% 
  mutate(txt = if_else(grepl(":", txt), txt, paste("Name:", txt)),
         rn = row_number()) %>% 
  separate(txt, into = c("mytype", "mytext"), sep = ":") %>% 
  spread(key = mytype, value = mytext) %>% 
  select(-rn) %>% 
  fill(Name) %>% 
  group_by(Name) %>% 
  fill(1:2, .direction = "down") %>% 
  fill(1:2, .direction = "up") %>% 
  unique() %>% 
  ungroup() %>% 
  select(3:1)

# # A tibble: 5 x 3
#     Name           Email `City/Town`
#    <chr>           <chr>       <chr>
# 1  Name1  [email protected]   Location1
# 2  Name2  [email protected]   Location2
# 3  Name3  [email protected]   Location3
# 4  Name4            <NA>   Location4
# 5  Name5  [email protected]        <NA>

Notes:

  • See here why we need rn.
  • Hoping someone will suggest better/simpler code using only tidyverse.

Upvotes: 3

Uwe
Uwe

Reputation: 42544

The input data offer several challenges:

  • Data are given as a straight character vector, not a as data.frame with predefined columns.
  • The rows partly consist of key/value pairs which are separated by ": "
  • The other rows act as section headers. All key/value pairs in the rows below belong to one section until the next header is reached.

The code below is relying only on two assumptions:

  1. key/value pairs contain one and only one ": "
  2. section headers none at all.

Multiple keys in a section, e.g., multiple rows with email adresses are handled by specifying toString() as aggregation function to dcast().

library(data.table)
# coerce to data.table
data.table(text = txt)[
  # split key/value pairs in columns
  , tstrsplit(text, ": ")][
    # pick section headers and create new column 
    is.na(V2), Name := V1][
      # fill in Name into the rows below
      , Name := zoo::na.locf(Name)][
        # reshape key/value pairs from long to wide format using Name as row id
        !is.na(V2), dcast(.SD, Name ~ V1, fun = toString, value.var = "V2")]
    Name City/Town          Email
1: Name1 Location1 [email protected]
2: Name2 Location2 [email protected]
3: Name3 Location3 [email protected]
4: Name4 Location4             NA
5: Name5        NA [email protected]

Data

txt <- c("Name1", "Email: [email protected]", "City/Town: Location1", "Name2", 
"Email: [email protected]", "City/Town: Location2", "Name3", "Email: [email protected]", 
"City/Town: Location3", "Name4", "City/Town: Location4", "Name5", 
"Email: [email protected]")

Or, try somewhat more "realistic" names

txt1 <- c("John Doe", "Email: [email protected]", "City/Town: Location1", "Save the World Fund", 
"Email: [email protected]", "City/Town: Location2", "Best Shoes Ltd.", "Email: [email protected]", 
"City/Town: Location3", "Mother", "City/Town: Location4", "Jane", 
"Email: [email protected]")

which will result in:

                  Name City/Town          Email
1:     Best Shoes Ltd. Location3 [email protected]
2:                Jane        NA [email protected]
3:            John Doe Location1 [email protected]
4:              Mother Location4             NA
5: Save the World Fund Location2 [email protected]

Or, with multiple keys per section

txt2 <- c("John Doe", "Email: [email protected]", "Email: [email protected]", "City/Town: Location1", "Save the World Fund", 
"Email: [email protected]", "City/Town: Location2", "Best Shoes Ltd.", "Email: [email protected]", 
"City/Town: Location3", "Mother", "City/Town: Location4", "City/Town: everywhere", "Jane", 
"Email: [email protected]")
                  Name             City/Town                          Email
1:     Best Shoes Ltd.             Location3                 [email protected]
2:                Jane                                       [email protected]
3:            John Doe             Location1 [email protected], [email protected]
4:              Mother Location4, everywhere                               
5: Save the World Fund             Location2                 [email protected]

Upvotes: 3

Jaap
Jaap

Reputation: 83215

Using:

txt <- readLines(txt)

library(data.table)
library(zoo)

dt <- data.table(txt = txt)

dt[!grepl(':', txt), name := txt
   ][, name := na.locf(name)
     ][grepl('^Email:', txt), email := sub('Email: ','',txt)
       ][grepl('^City/Town:', txt), city_town := sub('City/Town: ','',txt)
         ][txt != name, lapply(.SD, function(x) toString(na.omit(x))), by = name, .SDcols = c('email','city_town')]

gives:

    name          email city_town
1: Name1 [email protected] Location1
2: Name2 [email protected] Location2
3: Name3 [email protected] Location3
4: Name4                Location4
5: Name5 [email protected]          

This also works with real names. With the data of @uweBlock you'll get:

                  name          email city_town
1:            John Doe [email protected] Location1
2: Save the World Fund [email protected] Location2
3:     Best Shoes Ltd. [email protected] Location3
4:              Mother                Location4
5:                Jane [email protected]

And with multiple keys per section (again with @UweBlock's data):

                  name                          email             city_town
1:            John Doe [email protected], [email protected]             Location1
2: Save the World Fund                 [email protected]             Location2
3:     Best Shoes Ltd.                 [email protected]             Location3
4:              Mother                                Location4, everywhere
5:                Jane                 [email protected]

Used data:

txt <- textConnection("Name1
Email: [email protected]
City/Town: Location1
Name2
Email: [email protected]
City/Town: Location2
Name3
Email: [email protected]
City/Town: Location3
Name4
City/Town: Location4
Name5
Email: [email protected]")

Upvotes: 4

Related Questions