vagabond
vagabond

Reputation: 3594

extracting values of a column into a string and replacing values in a data frame column

More than the programming, I am lost on the right approach for this problem. I have 2 data frames with a market name column. Unfortunately the names vary by a few characters / symbols in each column, for e.g. Albany.Schenectady.Troy = ALBANY, Boston.Manchester = BOSTON.

I want to standardize the market names in both data frames so I can perform merge operations later.

I thought of tackling the problem in two steps: 1) Create a vector of the unique market names from both tables and use that to create a look up table. Something that looks like:

Table 1 Markets > "Albany.Schenectady.Troy" , "Albuquerque.Santa.Fe", "Atlanta" . . . .

Table2 Markets > "SPOKANE" , "BOSTON" . . .

I tried marketnamesvector <- paste(unique(Table1$Market, sep = "", collapse = ",")) but that doesn't produce the desired output.

2) Change Market names in Table 2 to equivalent market names in Table 1. For any market name not available in Table 1, Table 2 should retain the same value in market name.

I know I could use a looping function like below but I still need a lookup table I think.

replacefunc <- function (data, oldvalue, newvalue) { newdata <- data for (i in unique(oldvalue)) newdata[data == i] <- newvalue[oldvalue == i] newdata }

Table 1: This table is 90 rows x 2 columns and has 90 unique market names.

                   Market    Leads   Investment         Leads1     Leads2   Leads3
1 Albany.Schenectady.Troy      NA           NA            NA        NA       NA
2    Albuquerque.Santa.Fe      NA           NA            NA        NA       NA
3                 Atlanta      NA           NA            NA        NA       NA
4                  Austin      NA           NA            NA        NA       NA
5               Baltimore      NA           NA            NA        NA       NA

Table 2 : This table is 150K rows x 20 columns and has 89 unique market names.

> df

    Spot.ID    Date Hour     Time Local.Date Broadcast.Week Local.Hour Local.Time  Market
2   13072765 6/30/14    0 12:40 AM 2014-06-29              1         21    9:40 PM SPOKANE
261 13072946 6/30/14    5  5:49 AM 2014-06-30              1          5    5:49 AM  BOSTON
356 13081398 6/30/14   10 10:52 AM 2014-06-30              1          7    7:52 AM SPOKANE
389 13082306 6/30/14   11 11:25 AM 2014-06-30              1          8    8:25 AM SPOKANE
438 13082121 6/30/14    8  8:58 AM 2014-06-30              1          8    8:58 AM  BOSTON
469 13081040 6/30/14    9  9:17 AM 2014-06-30              1          9    9:17 AM  ALBANY
482 13080104 6/30/14   12 12:25 PM 2014-06-30              1          9    9:25 AM SPOKANE
501 13082120 6/30/14    9  9:36 AM 2014-06-30              1          9    9:36 AM  BOSTON
617 13080490 6/30/14   13  1:23 PM 2014-06-30              1         10   10:23 AM SPOKANE

Upvotes: 1

Views: 589

Answers (1)

user3969377
user3969377

Reputation:

Assume that the data is in data frames df1, df2. The goal is to adjust the market names to be the same, they are currently slightly different.

First, list the markets, use the following command to list the unique names in df1, repeat for df2.

mk1 <- sort(unique(df1$market))
mk2 <- sort(unique(df2$market))
dmk12 <- setdiff(mk1,mk2)
dmk21 <- setdiff(mk2,mk1)

Use dmk12 and dmk21 to identify the different markets. Decide what names to use, and how they match up, let's change "Atlanta, GA" from df1 to "Atlanta" from df2. Then use

df2[df2$market=="Atlanta","market"] = "Atlanta, GA"

The format is

df_to_change[df_to_change[,"column"]=="old data", "column"] = "new data"

If you only have 90 names to correct, I would write out 90 change lines like the one above.

After adjusting all the names, do sort(unique(df)) again and use setdiff twice to confirm all the names are the same.

Upvotes: 2

Related Questions