jmp88
jmp88

Reputation: 1

using dplyr to create and update dataframe from a single column of data

I'm using R to create a dataset. I have a list of information below provided as a single column xls. I'd like to change this list into a dataset (data.frame?) with the column names outlined below. I will be getting more xls docs containing new lists but while the column names will always stay the same, the rows below them will change as new data is added (always in the order below). What is the easiest way to automate the data import and add the new information to the existing dataset (assuming they are multiple xls files)?

Name:
Jon Doe
City:
Chicago
State:
IL
Zip:
60007
Angler Class:
Male
Fish Information:
Date Caught:
09/13/1999
Time:
8 pm
Length:
12.00 Inches
Weight:
Not Specified
Girth:
Not Specified
Fish Depth:
9 Feet
Lake Depth:
Not Specified
Kept/Release:
Released
Species:
Bass
Weather Information:
Sky:
Clear
Wind Direction:
South West
Wind Speed:
Strong
Lake Information:
Lake Name:
Pewaukee Lake
County:
Waukesha
State:
Wisconsin
Lake Bottom:
Junk Weeds
Structure:
Not Specified
Lure Information:
Lure Name:
Rapala
Primary Color:
Black
Secondary Color:
Not Specified
Presentation:
Casting

Output I want:

Name City State Zip Date Caught Time Species Lure Name Lake Name State
Jon Doe Chicago IL 60007 09/13/1999 8 pm Bass Rapala Pewaukee Lake Wisconsin
Jane Doe Pittsburgh PA 15237 08/15/1999 6 pm Bass Jig Lake St. Clair Michigan

*note that "state" is used twice in the list to represent the angler state and the lake state. I'd like to capture it as "angle_state" / "lake_state" on my table.

Upvotes: 0

Views: 63

Answers (1)

Jon Spring
Jon Spring

Reputation: 66415

Here's a tidyverse approach that takes the list, marks it as "header" if it has a colon, fills down the header names, sorts by each header to keep the response, if available, and then pivots wider to the desired format:

library(tidyverse)
my_data %>%

    # label headers and fill down
    rowid_to_column("row")  %>%
    mutate(header = str_detect(data, ":"),
           header_name = if_else(header, data, NA_character_),
           response    = if_else(header, NA_character_, data)) %>%
    fill(header_name) %>%
    
    # get best row for each header (ie non-NA if it exists)
    group_by(header_name) %>%
    arrange(response) %>%
    slice(1) %>% 
    ungroup() %>%
    
    # rearrange to original order and spread wider
    arrange(row) %>%
    select(-c(row:header)) %>% 
    pivot_wider(names_from = header_name, values_from = response)

Result

# A tibble: 1 x 29
  `Name:` `City:` `State:` `Zip:` `Angler Class:` `Fish Informatio… `Date Caught:` `Time:` `Length:`  `Weight:`   `Girth:`   `Fish Depth:` `Lake Depth:` `Kept/Release:` `Species:` `Weather Informat… `Sky:` `Wind Direction… `Wind Speed:` `Lake Informati… `Lake Name:` `County:` `Lake Bottom:` `Structure:` `Lure Informati… `Lure Name:`
  <chr>   <chr>   <chr>    <chr>  <chr>           <chr>             <chr>          <chr>   <chr>      <chr>       <chr>      <chr>         <chr>         <chr>           <chr>      <chr>              <chr>  <chr>            <chr>         <chr>            <chr>        <chr>     <chr>          <chr>        <chr>            <chr>       
1 Jon Doe Chicago IL       60007  Male            NA                09/13/1999     8 pm    12.00 Inc… Not Specif… Not Speci… 9 Feet        Not Specified Released        Bass       NA                 Clear  South West       Strong        NA               Pewaukee La… Waukesha  Junk Weeds     Not Specifi… NA               Rapala      
# … with 3 more variables: Primary Color: <chr>, Secondary Color: <chr>, Presentation: <chr>

You should be able to take the output of this and dplyr::row_bind to other data to grow your table to include multiple people.

Upvotes: 1

Related Questions