kill9all
kill9all

Reputation: 111

How to parse a file and change strings in R

I have a large text file that needs to have some strings changed in R (which I'm unfamiliar with). My file is called trips.

My columns would include route_number, trip_name, direction...

The problem I have is that trip_name is consistently misspelled. I have thousands of trips titled "123 to Kalamazoo" but it actually needs to be "Only bus downtown". There are about a dozen such errors, in a file with about 60,000 records, but I'm not looking for an elegant solution. Other people (minimal IT skills) need to be able to go in add corrections as required.

This has the net effect of changing every trip_name in the file to "Only bus downtown".

It's not pretty, but just cutting and pasting the second line of script, while changing the values, seems easy to my end users.

TripRename <- read.csv("data/trips.txt", header = TRUE, stringsAsFactors = FALSE)
TripRename$trip_name <- gsub("123 to Kalamazoo", "Only bus downtown", "123 to Kalamazoo")

Upvotes: 0

Views: 95

Answers (2)

r2evans
r2evans

Reputation: 160827

If you're looking for others to be able to easily work with it, you have some options, depending on the variability of the data and their abilities. I'll assume that you have multiple such errata that all need to be updated.

  1. Starting with @ShubhamPujan's recommendation, you can do a one-for-one change, assuming that your list of verbatim wrong entries are in a vector, and all need to be changed to "Only bus downtown":

    bad_name <- "123 to Kalamazoo"
    TripRename$new_name <- replace(TripRename$trip_name, TripRename$trip_name %in% bad_name, "Only bus downtown")
    TripRename
    #          trip_name          new_name
    # 1 123 to Kalamazoo Only bus downtown
    # 2 456 to elsewhere  456 to elsewhere
    

    This works fine assuming that they all change to the same string.

  2. If you have more than one correction, then you can use a search/replace. One method is a named vector,

    lookup1 <- c("123 to Kalamazoo" = "Only bus downtown")
    new_names <- lookup1[TripRename$trip_name]
    TripRename$new_name <- ifelse(is.na(new_names), TripRename$trip_name, new_names)
    TripRename
    #          trip_name          new_name
    # 1 123 to Kalamazoo Only bus downtown
    # 2 456 to elsewhere  456 to elsewhere
    

    or a two-column frame:

    lookupdf <- data.frame(
      trip_name = "123 to Kalamazoo",
      new_name = "Only bus downtown"
    )
    lookupdf
    #          trip_name          new_name
    # 1 123 to Kalamazoo Only bus downtown
    merged <- merge(TripRename, lookupdf, by = "trip_name", all.x = TRUE)
    merged$new_name <- ifelse(is.na(merged$new_name), merged$trip_name, merged$new_name)
    merged
    #          trip_name          new_name
    # 1 123 to Kalamazoo Only bus downtown
    # 2 456 to elsewhere  456 to elsewhere
    
  3. If you have patterns (e.g., "456 to Kalamazoo" should also be updated), then you can either add either each one verbatim (above) or use regular expressions. I should note, though, that if you have people inexperienced with regex working on/with this, then they can easily misrepresent the findings (false-positives, false-negatives).

    regex <- data.frame(
      trip_name = c("^[0-9]+ to Kalamazoo$", "quux foo"),
      new_name = c("Only bus downtown", "Another bus nowhere")
    )
    regex
    #               trip_name            new_name
    # 1 ^[0-9]+ to Kalamazoo$   Only bus downtown
    # 2              quux foo Another bus nowhere
    TripRename$new_name <- Reduce(
      function(val, i) gsub(regex$trip_name[i], regex$new_name[i], val),
      seq_len(nrow(regex)), init = TripRename$trip_name)
    TripRename
    #          trip_name          new_name
    # 1 123 to Kalamazoo Only bus downtown
    # 2 456 to elsewhere  456 to elsewhere
    

    (As an alternative to this, you can join the concepts of lookupdf and regex using the fuzzyjoin package.

Upvotes: 1

kill9all
kill9all

Reputation: 111

Thanks Shubham Pujan. You answer was perfect.

TripRename$trip_name <- gsub("123 to Kalamazoo", "Only bus downtown",TripRename$trip_name)

Upvotes: 1

Related Questions