Reputation: 31
I have an output of web scraped data in R which looks as follows
Name1
Email: [email protected]
City/Town: Location1
Name2
Email: [email protected]
City/Town: Location2
Name3
Email: [email protected]
City/Town: Location3
Some names may not have email or location. I want convert above data into tabular format. The output should look like
Name Email City/Town
Name1 [email protected] Location1
Name2 [email protected] Location2
Name3 [email protected] Location3
Name4 Location4
Name5 [email protected]
Upvotes: 3
Views: 263
Reputation: 269371
Insert \nName:
before each Name and then read it in using read.dcf
(If the data comes from a file replace textConnection(Lines)
with the filename, e.g. "myfile.dat"
, in the first line of code.) No packages are used.
L <- trimws(readLines(textConnection(Lines)))
ix <- !grepl(":", L)
L[ix] <- paste("\nName:", L[ix])
read.dcf(textConnection(L))
giving the following using the input in the Note at the end:
Name Email City/Town
[1,] "Name1" "[email protected]" "Location1"
[2,] "Name2" NA "Location2"
[3,] "Name3" "[email protected]" NA
Note: Input used. This is slightly modified from question to show that it works if Email or City/Town is missing:
Lines <- "Name1
Email: [email protected]
City/Town: Location1
Name2
City/Town: Location2
Name3
Email: [email protected]"
Upvotes: 4
Reputation: 35177
txt2 <- c("John Doe", "Email: [email protected]", "Email: [email protected]", "City/Town: Location1", "Save the World Fund",
"Email: [email protected]", "City/Town: Location2", "Best Shoes Ltd.", "Email: [email protected]",
"City/Town: Location3", "Mother", "City/Town: Location4", "City/Town: everywhere", "Jane",
"Email: [email protected]")
library(microbenchmark)
library(data.table)
library(dplyr)
library(tidyr)
microbenchmark(ans.uwe = data.table(text = txt2)[, tstrsplit(text, ": ")
][is.na(V2), Name := V1
][, Name := zoo::na.locf(Name)
][!is.na(V2), dcast(.SD, Name ~ V1, fun = toString, value.var = "V2")],
ans.zx8754 = data_frame(txt = txt2) %>%
mutate(txt = ifelse(grepl(":", txt), txt, paste("Name:", txt)),
rn = row_number()) %>%
separate(txt, into = c("mytype", "mytext"), sep = ":") %>%
spread(key = mytype, value = mytext) %>%
select(-rn) %>%
fill(Name) %>%
group_by(Name) %>%
fill(1:2, .direction = "down") %>%
fill(1:2, .direction = "up") %>%
unique() %>%
ungroup() %>%
select(3:1),
ans.jaap = data.table(txt = txt2)[!grepl(':', txt), name := txt
][, name := zoo::na.locf(name)
][grepl('^Email:', txt), email := sub('Email: ','',txt)
][grepl('^City/Town:', txt), city_town := sub('City/Town: ','',txt)
][txt != name, lapply(.SD, function(x) toString(na.omit(x))), by = name, .SDcols = c('email','city_town')],
ans.G.Grothendieck = {
L <- trimws(readLines(textConnection(txt2)))
ix <- !grepl(":", L)
L[ix] <- paste("\nName:", L[ix])
read.dcf(textConnection(L))},
times = 1000)
Unit: microseconds
expr min lq mean median uq max neval cld
ans.uwe 4243.754 4885.4765 5305.8688 5139.0580 5390.360 92604.820 1000 c
ans.zx8754 39683.911 41771.2925 43940.7646 43168.4870 45291.504 130965.088 1000 d
ans.jaap 2153.521 2488.0665 2788.8250 2640.1580 2773.150 91862.177 1000 b
ans.G.Grothendieck 266.268 304.0415 332.6255 331.8375 349.797 721.261 1000 a
Upvotes: 2
Reputation: 56004
Using dplyr and tidyr, tested on both data provided by @Jaap txt
, and by @UweBlock txt1
:
library(dplyr)
library(tidyr)
# data_frame(txt = txt1) %>%
data_frame(txt = txt) %>%
mutate(txt = if_else(grepl(":", txt), txt, paste("Name:", txt)),
rn = row_number()) %>%
separate(txt, into = c("mytype", "mytext"), sep = ":") %>%
spread(key = mytype, value = mytext) %>%
select(-rn) %>%
fill(Name) %>%
group_by(Name) %>%
fill(1:2, .direction = "down") %>%
fill(1:2, .direction = "up") %>%
unique() %>%
ungroup() %>%
select(3:1)
# # A tibble: 5 x 3
# Name Email `City/Town`
# <chr> <chr> <chr>
# 1 Name1 [email protected] Location1
# 2 Name2 [email protected] Location2
# 3 Name3 [email protected] Location3
# 4 Name4 <NA> Location4
# 5 Name5 [email protected] <NA>
Notes:
rn
. Upvotes: 3
Reputation: 42544
The input data offer several challenges:
": "
The code below is relying only on two assumptions:
": "
Multiple keys in a section, e.g., multiple rows with email adresses are handled by specifying toString()
as aggregation function to dcast()
.
library(data.table)
# coerce to data.table
data.table(text = txt)[
# split key/value pairs in columns
, tstrsplit(text, ": ")][
# pick section headers and create new column
is.na(V2), Name := V1][
# fill in Name into the rows below
, Name := zoo::na.locf(Name)][
# reshape key/value pairs from long to wide format using Name as row id
!is.na(V2), dcast(.SD, Name ~ V1, fun = toString, value.var = "V2")]
Name City/Town Email 1: Name1 Location1 [email protected] 2: Name2 Location2 [email protected] 3: Name3 Location3 [email protected] 4: Name4 Location4 NA 5: Name5 NA [email protected]
txt <- c("Name1", "Email: [email protected]", "City/Town: Location1", "Name2",
"Email: [email protected]", "City/Town: Location2", "Name3", "Email: [email protected]",
"City/Town: Location3", "Name4", "City/Town: Location4", "Name5",
"Email: [email protected]")
Or, try somewhat more "realistic" names
txt1 <- c("John Doe", "Email: [email protected]", "City/Town: Location1", "Save the World Fund",
"Email: [email protected]", "City/Town: Location2", "Best Shoes Ltd.", "Email: [email protected]",
"City/Town: Location3", "Mother", "City/Town: Location4", "Jane",
"Email: [email protected]")
which will result in:
Name City/Town Email 1: Best Shoes Ltd. Location3 [email protected] 2: Jane NA [email protected] 3: John Doe Location1 [email protected] 4: Mother Location4 NA 5: Save the World Fund Location2 [email protected]
Or, with multiple keys per section
txt2 <- c("John Doe", "Email: [email protected]", "Email: [email protected]", "City/Town: Location1", "Save the World Fund",
"Email: [email protected]", "City/Town: Location2", "Best Shoes Ltd.", "Email: [email protected]",
"City/Town: Location3", "Mother", "City/Town: Location4", "City/Town: everywhere", "Jane",
"Email: [email protected]")
Name City/Town Email 1: Best Shoes Ltd. Location3 [email protected] 2: Jane [email protected] 3: John Doe Location1 [email protected], [email protected] 4: Mother Location4, everywhere 5: Save the World Fund Location2 [email protected]
Upvotes: 3
Reputation: 83215
Using:
txt <- readLines(txt)
library(data.table)
library(zoo)
dt <- data.table(txt = txt)
dt[!grepl(':', txt), name := txt
][, name := na.locf(name)
][grepl('^Email:', txt), email := sub('Email: ','',txt)
][grepl('^City/Town:', txt), city_town := sub('City/Town: ','',txt)
][txt != name, lapply(.SD, function(x) toString(na.omit(x))), by = name, .SDcols = c('email','city_town')]
gives:
name email city_town
1: Name1 [email protected] Location1
2: Name2 [email protected] Location2
3: Name3 [email protected] Location3
4: Name4 Location4
5: Name5 [email protected]
This also works with real names. With the data of @uweBlock you'll get:
name email city_town 1: John Doe [email protected] Location1 2: Save the World Fund [email protected] Location2 3: Best Shoes Ltd. [email protected] Location3 4: Mother Location4 5: Jane [email protected]
And with multiple keys per section (again with @UweBlock's data):
name email city_town 1: John Doe [email protected], [email protected] Location1 2: Save the World Fund [email protected] Location2 3: Best Shoes Ltd. [email protected] Location3 4: Mother Location4, everywhere 5: Jane [email protected]
Used data:
txt <- textConnection("Name1
Email: [email protected]
City/Town: Location1
Name2
Email: [email protected]
City/Town: Location2
Name3
Email: [email protected]
City/Town: Location3
Name4
City/Town: Location4
Name5
Email: [email protected]")
Upvotes: 4