Krzysztof Owadowski
Krzysztof Owadowski

Reputation: 21

PySpark: create new column based on dictionary values matching with string in another column

I have a dataframe A that looks like this:

ID SOME_CODE TITLE
1 024df3 Large garden in New York, New York
2 0ffw34 Small house in dark Detroit, Michigan
3 93na09 Red carpet in beautiful Miami
4 8339ct Skyscraper in Los Angeles, California
5 84p3k9 Big shop in northern Boston, Massachusetts

I have also another dataframe B:

City Shortcut
Los Angeles LA
New York NYC
Miami MI
Boston BO
Detroit DTW

I would like to add new "SHORTCUT" column to dataframe A, based on the fact that "Title" column in A contains city from column "City" in dataframe B. I have tried to use dataframe B as dictionary and map it to dataframe A, but I can't overcome fact that city names are in the middle of the sentence.

The desired output is:

ID SOME_CODE TITLE SHORTCUT
1 024df3 Large garden in New York, New York NYC
2 0ffw34 Small house in dark Detroit, Michigan DTW
3 93na09 Red carpet in beautiful Miami, Florida MI
4 8339ct Skyscraper in Los Angeles, California LA
5 84p3k9 Big shop in northern Boston, Massachusetts BO

I will appreciate your help.

Upvotes: 2

Views: 766

Answers (1)

teedak8s
teedak8s

Reputation: 780

You can leverage pandas.apply function And see if this helps:

import numpy as np
import pandas as pd

data1={'id':range(5),'some_code':["024df3","0ffw34","93na09","8339ct","84p3k9"],'title':["Large garden in New York, New York","Small house in dark Detroit, Michigan","Red carpet in beautiful Miami","Skyscraper in Los Angeles, California","Big shop in northern Boston, Massachusetts"]}
df1=pd.DataFrame(data=data1)

data2={'city':["Los Angeles","New York","Miami","Boston","Detroit"],"shortcut":["LA","NYC","MI","BO","DTW"]}
df2=pd.DataFrame(data=data2)

# Creating a list of cities.
cities=list(df2['city'].values)

def matcher(x):
  for index,city in enumerate(cities):
    if x.lower().find(city.lower())!=-1:
      return df2.iloc[index]["shortcut"]
  return np.nan

df1['shortcut']=df1['title'].apply(matcher)
print(df1.head())

This would generate the following o/p:

   id some_code                                       title shortcut
0   0    024df3          Large garden in New York, New York      NYC
1   1    0ffw34       Small house in dark Detroit, Michigan      DTW
2   2    93na09               Red carpet in beautiful Miami       MI
3   3    8339ct       Skyscraper in Los Angeles, California       LA
4   4    84p3k9  Big shop in northern Boston, Massachusetts       BO

Upvotes: 2

Related Questions