ebuzz168
ebuzz168

Reputation: 1184

Splitting a column to a few column based on comma

I want to split a column which an address to a specific column like cities and province.

I have a data frame which looks like:

df:
+----------------------------------------------------------------------------------------------------------+
|location                                                                                          
+----------------------------------------------------------------------------------------------------------+
| Jl. Raya Pasir Putih No.6, RT.1/RW.6, Pasir Putih, Kec. Sawangan, Kota Depok, Jawa Barat 16519, Indonesia   
| Jl. Legenda Wisata, Wanaherang, Kec. Gn. Putri, Bogor, Jawa Barat 16965, Indonesia                 
| Jl. Blk. C7 No.17, Rangkapan Jaya Baru, Kec. Pancoran Mas, Kota Depok, Jawa Barat 16434, Indonesia 
| Jl. Cibuntu Sayuran No.12, Wr. Muncang, Kec. Bandung Kulon, Kota Bandung, Jawa Barat 40211, Indonesia
| 1 KOMP, Jl. Tirtawening No.10, Cisurupan, Kec. Cibiru, Kota Bandung, Jawa Barat 40614, Indonesia
+----------------------------------------------------------------------------------------------------------+

I want to extract that into another column named City and Province

The output may look like:

df:

+-------------+-------------------+------------+
| location    |  Cities           |  province  | 
+-------------+-------------------+------------+
|  .....      |  Kota Depok       | Jawa Barat |    
|  .....      |  Bogor            | Jawa Barat |      
|  .....      |  Kota Depok       | Jawa Barat |     
|  .....      |  Kota Bandung     | Jawa Barat |    
|  .....      |  Kota Bandung     | Jawa Barat |   
+-------------+------------+-------------------+

I tried using this method:

def extract_city_state(a):
    asplit = a.split(",")
    city = asplit[-3].split()
    state = asplit[-2].split()[0:1]
    return city, state

df.join(
    df['location'].apply(
        lambda x: pd.Series(extract_city_state(x), index=["City", "State"])
    )
)

but it returns

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-29-64a945be5d02> in <module>
      1 df.join(
      2     df['location'].apply(
----> 3         lambda x: pd.Series(extract_city_state(x), index=["City", "State"])
      4     )
      5 )

~\anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
   4043             else:
   4044                 values = self.astype(object).values
-> 4045                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   4046 
   4047         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

<ipython-input-29-64a945be5d02> in <lambda>(x)
      1 df.join(
      2     df['location'].apply(
----> 3         lambda x: pd.Series(extract_city_state(x), index=["City", "State"])
      4     )
      5 )

<ipython-input-22-f1d63ccd82dc> in extract_city_state(a)
      1 def extract_city_state(a):
      2     asplit = a.split(",")
----> 3     city = asplit[-3].split()
      4     state = asplit[-2].split()[0:1]
      5     return city, state

IndexError: list index out of range

How to overcome this?

Upvotes: 3

Views: 118

Answers (2)

jezrael
jezrael

Reputation: 862511

Use only pandas str function for avoid errors if no match values selected by str[] indexing - first Series.str.split create Series of lists and last Series.str.rsplit split only by last space, because n=1 parameter:

s = df['location'].str.split(',')

df['city'] = s.str[-3]
df['province'] = s.str[-2].str.rsplit(n=1).str[0]
print (df)
                                            location           city  \
0  Jl. Raya Pasir Putih No.6, RT.1/RW.6, Pasir Pu...     Kota Depok   
1  Jl. Legenda Wisata, Wanaherang, Kec. Gn. Putri...          Bogor   
2  Jl. Blk. C7 No.17, Rangkapan Jaya Baru, Kec. P...     Kota Depok   
3  Jl. Cibuntu Sayuran No.12, Wr. Muncang, Kec. B...   Kota Bandung   
4  1 KOMP, Jl. Tirtawening No.10, Cisurupan, Kec....   Kota Bandung   

      province  
0   Jawa Barat  
1   Jawa Barat  
2   Jawa Barat  
3   Jawa Barat  
4   Jawa Barat  

Upvotes: 1

Darcey BM
Darcey BM

Reputation: 301

If you want to keep this as you have it as a function, just storing the lambda function result as a variable before joining it to the df works:

city_state_split = df['location'].apply(
        lambda x: pd.Series(extract_city_state(x), index=["City", "State"])
    )
df.join(city_state_split)

Upvotes: 1

Related Questions