Reputation: 1
I'm using R to create a dataset. I have a list of information below provided as a single column xls. I'd like to change this list into a dataset (data.frame?) with the column names outlined below. I will be getting more xls docs containing new lists but while the column names will always stay the same, the rows below them will change as new data is added (always in the order below). What is the easiest way to automate the data import and add the new information to the existing dataset (assuming they are multiple xls files)?
Name:
Jon Doe
City:
Chicago
State:
IL
Zip:
60007
Angler Class:
Male
Fish Information:
Date Caught:
09/13/1999
Time:
8 pm
Length:
12.00 Inches
Weight:
Not Specified
Girth:
Not Specified
Fish Depth:
9 Feet
Lake Depth:
Not Specified
Kept/Release:
Released
Species:
Bass
Weather Information:
Sky:
Clear
Wind Direction:
South West
Wind Speed:
Strong
Lake Information:
Lake Name:
Pewaukee Lake
County:
Waukesha
State:
Wisconsin
Lake Bottom:
Junk Weeds
Structure:
Not Specified
Lure Information:
Lure Name:
Rapala
Primary Color:
Black
Secondary Color:
Not Specified
Presentation:
Casting
Output I want:
Name | City | State | Zip | Date Caught | Time | Species | Lure Name | Lake Name | State |
---|---|---|---|---|---|---|---|---|---|
Jon Doe | Chicago | IL | 60007 | 09/13/1999 | 8 pm | Bass | Rapala | Pewaukee Lake | Wisconsin |
Jane Doe | Pittsburgh | PA | 15237 | 08/15/1999 | 6 pm | Bass | Jig | Lake St. Clair | Michigan |
*note that "state" is used twice in the list to represent the angler state and the lake state. I'd like to capture it as "angle_state" / "lake_state" on my table.
Upvotes: 0
Views: 63
Reputation: 66415
Here's a tidyverse approach that takes the list, marks it as "header" if it has a colon, fills down the header names, sorts by each header to keep the response, if available, and then pivots wider to the desired format:
library(tidyverse)
my_data %>%
# label headers and fill down
rowid_to_column("row") %>%
mutate(header = str_detect(data, ":"),
header_name = if_else(header, data, NA_character_),
response = if_else(header, NA_character_, data)) %>%
fill(header_name) %>%
# get best row for each header (ie non-NA if it exists)
group_by(header_name) %>%
arrange(response) %>%
slice(1) %>%
ungroup() %>%
# rearrange to original order and spread wider
arrange(row) %>%
select(-c(row:header)) %>%
pivot_wider(names_from = header_name, values_from = response)
Result
# A tibble: 1 x 29
`Name:` `City:` `State:` `Zip:` `Angler Class:` `Fish Informatio… `Date Caught:` `Time:` `Length:` `Weight:` `Girth:` `Fish Depth:` `Lake Depth:` `Kept/Release:` `Species:` `Weather Informat… `Sky:` `Wind Direction… `Wind Speed:` `Lake Informati… `Lake Name:` `County:` `Lake Bottom:` `Structure:` `Lure Informati… `Lure Name:`
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Jon Doe Chicago IL 60007 Male NA 09/13/1999 8 pm 12.00 Inc… Not Specif… Not Speci… 9 Feet Not Specified Released Bass NA Clear South West Strong NA Pewaukee La… Waukesha Junk Weeds Not Specifi… NA Rapala
# … with 3 more variables: Primary Color: <chr>, Secondary Color: <chr>, Presentation: <chr>
You should be able to take the output of this and dplyr::row_bind
to other data to grow your table to include multiple people.
Upvotes: 1