Steven Wong
Steven Wong

Reputation: 13

Split Location into City, State Code and Country in Panda

I would like to know to split a location column into a few new columns such as city, state code and country in pandas. From this:

 'Location': {0: 'Warszawa, Poland',
  1: 'San Francisco, CA, United States',
  2: 'Los Angeles, CA, United States',
  3: 'Sunnyvale, CA, United States',
  4: 'Sunnyvale, CA, United States',
  5: 'San Francisco, CA, United States',
  6: 'Sunnyvale, CA, United States',
  7: 'Kraków, Poland',
  8: 'Shanghai, China',
  9: 'Mountain View, CA, United States',
  10: 'Boulder, CO, United States',
  11: 'Boulder, CO, United States',
  12: 'Xinyi District, Taiwan',
  13: 'Tel Aviv-Yafo, Israel',
  14: 'Wrocław, Poland',
  15: 'Singapore'}

To this:

 'Country': {0: 'Poland',
  1: 'United States',
  2: 'United States',
  3: 'United States',
  4: 'United States',
  5: 'United States',
  6: 'United States',
  7: 'Poland',
  8: 'China',
  9: 'United States',
  10: 'United States',
  11: 'United States',
  12: 'Taiwan',
  13: 'Israel',
  14: 'Poland',
  15: 'Singapore'}

Thanks.

Upvotes: 0

Views: 2740

Answers (3)

Vishwas
Vishwas

Reputation: 351

This is little more refined and does the same job, and can be fit in one line of code.

b['City'] = b['Location'].str.split(',').apply(lambda x: x[0])
b['Country'] = b['Location'].str.split(',').apply(lambda x: x[-1])
b

The output:

    Location                            City             Country
0   Warszawa, Poland                    Warszawa          Poland
1   San Francisco, CA, United States    San Francisco     United States
2   Los Angeles, CA, United States      Los Angeles       United States
3   Sunnyvale, CA, United States        Sunnyvale         United States
4   Sunnyvale, CA, United States        Sunnyvale         United States
5   San Francisco, CA, United States    San Francisco     United States
6   Sunnyvale, CA, United States        Sunnyvale         United States
7   Kraków, Poland                      Kraków            Poland
8   Shanghai, China                     Shanghai          China

This is the one line version, however I am having trouble to have them in two different columns. Something wrong here, I am not able to find it out.

b['City', 'Country']= pd.DataFrame (b['Location'].str.split(',').apply(lambda x:( x[0], x[-1]))) 


    (City,  Country)
0   (Warszawa, Poland)
1   (San Francisco, United States)
2   (Los Angeles, United States)
3   (Sunnyvale, United States)
4   (Sunnyvale, United States)
5   (San Francisco, United States)

Upvotes: 0

Vishwas
Vishwas

Reputation: 351

I am not sure if its the best method, others please comment or suggest a better method. I tried to split the data, but the challenge is that foreign countries have only city and country name and the entries from US have City, Sate and Country. Hence, I was not able to split it with one method. Below are two methods which I employed to split the data and then you have to figure out how to merge into one dataframe.

 b = pd.DataFrame ({'Location': {0: 'Warszawa, Poland',
  1: 'San Francisco, CA, United States',
  2: 'Los Angeles, CA, United States',
  3: 'Sunnyvale, CA, United States',
  4: 'Sunnyvale, CA, United States',
  5: 'San Francisco, CA, United States',
  6: 'Sunnyvale, CA, United States',
  7: 'Kraków, Poland',
  8: 'Shanghai, China',
  9: 'Mountain View, CA, United States',
  10: 'Boulder, CO, United States',
  11: 'Boulder, CO, United States',
  12: 'Xinyi District, Taiwan',
  13: 'Tel Aviv-Yafo, Israel',
  14: 'Wrocław, Poland',
  15: 'Singapore'}})

c[['City', 'Country']] = b['Location'].str.split(',', n=1, expand=True) # This splits the data into city and Country. So this works very well for Foriegn address or data with just city and country. 

 Out put is:

     City       Country
0   Warszawa    Poland
1   San Francisco   CA, United States
2   Los Angeles CA, United States
3   Sunnyvale   CA, United States
4   Sunnyvale   CA, United States
5   San Francisco   CA, United States
6   Sunnyvale   CA, United States
7   Kraków  Poland
8   Shanghai    China

Second method is:

regex = r'(?P<City>[^,]+)\s*,\s*(?P<State>[^\s]+)\s+(?P<Country>[^,]+)'
df=b['Location'].str.extract(regex)
df # This splits the data into City, State and Country, so it works well for US address. 

Output is :

    City       State    Country
0   NaN          NaN    NaN
1   San Francisco CA,   United States
2   Los Angeles CA,     United States
3   Sunnyvale   CA,     United States
4   Sunnyvale   CA,     United States
5   San Francisco CA,   United States
6   Sunnyvale   CA,     United States
7   NaN          NaN    NaN

Upvotes: 2

Kamaraju Kusumanchi
Kamaraju Kusumanchi

Reputation: 1964

$ ipython
Python 3.6.8 |Anaconda custom (64-bit)| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.5.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: d = {'Location': {0: 'Warszawa, Poland',
   ...:   1: 'San Francisco, CA, United States',
   ...:   2: 'Los Angeles, CA, United States',
   ...:   3: 'Sunnyvale, CA, United States',
   ...:   4: 'Sunnyvale, CA, United States',
   ...:   5: 'San Francisco, CA, United States',
   ...:   6: 'Sunnyvale, CA, United States',
   ...:   7: 'Kraków, Poland',
   ...:   8: 'Shanghai, China',
   ...:   9: 'Mountain View, CA, United States',
   ...:   10: 'Boulder, CO, United States',
   ...:   11: 'Boulder, CO, United States',
   ...:   12: 'Xinyi District, Taiwan',
   ...:   13: 'Tel Aviv-Yafo, Israel',
   ...:   14: 'Wrocław, Poland',
   ...:   15: 'Singapore'}}

In [2]: import pandas as pd
   ...: df = pd.DataFrame.from_dict(d)
   ...: df
Out[2]:
                            Location
0                   Warszawa, Poland
1   San Francisco, CA, United States
2     Los Angeles, CA, United States
3       Sunnyvale, CA, United States
4       Sunnyvale, CA, United States
5   San Francisco, CA, United States
6       Sunnyvale, CA, United States
7                     Kraków, Poland
8                    Shanghai, China
9   Mountain View, CA, United States
10        Boulder, CO, United States
11        Boulder, CO, United States
12            Xinyi District, Taiwan
13             Tel Aviv-Yafo, Israel
14                   Wrocław, Poland
15                         Singapore

In [3]: df['Country'] = df['Location'].str.split(',').apply(lambda x: x[-1])
   ...: df
Out[3]:
                            Location         Country
0                   Warszawa, Poland          Poland
1   San Francisco, CA, United States   United States
2     Los Angeles, CA, United States   United States
3       Sunnyvale, CA, United States   United States
4       Sunnyvale, CA, United States   United States
5   San Francisco, CA, United States   United States
6       Sunnyvale, CA, United States   United States
7                     Kraków, Poland          Poland
8                    Shanghai, China           China
9   Mountain View, CA, United States   United States
10        Boulder, CO, United States   United States
11        Boulder, CO, United States   United States
12            Xinyi District, Taiwan          Taiwan
13             Tel Aviv-Yafo, Israel          Israel
14                   Wrocław, Poland          Poland
15                         Singapore       Singapore

In [4]: df['Country'].to_dict()
Out[4]:
{0: ' Poland',
 1: ' United States',
 2: ' United States',
 3: ' United States',
 4: ' United States',
 5: ' United States',
 6: ' United States',
 7: ' Poland',
 8: ' China',
 9: ' United States',
 10: ' United States',
 11: ' United States',
 12: ' Taiwan',
 13: ' Israel',
 14: ' Poland',
 15: 'Singapore'}

Upvotes: 1

Related Questions