anonymus
anonymus

Reputation: 49

How to create new columns based on values before delimiter and add values after the delimiter in Pandas?

I am reading a .txt in Pandas. I want to create new column names based on values before '=>' and add values in that column, based on value after '=>'

Input:

other_tags
"access"=>"agricultural","agricultural"=>"no"
"access"=>"customers"
"access"=>"customers","amenity"=>"parking"
"description"=>"GRAND PERE DE MON AMI"
"description"=>"GRAND PERE"
"design"=>"asymmetric","power"=>"tower"
"design"=>"asymmetric","power"=>"tower"

code:

import pandas as pd
df1 = pd.read_csv('try.txt', sep ='\t')
df1['access_type'] = df1['other_tags'].get('access')
df1['amenity'] = df1['other_tags'].get('amenity')
print(df1)

expected_output:

other_tags                                         access_type                            amenity
"access"=>"agricultural","agricultural"=>"no"      agricultural 
"access"=>"customers"                              customers    
"access"=>"customers","amenity"=>"parking"         customers                              parking
"description"=>"GRAND PERE DE MON AMI"      
"description"=>"GRAND PERE"     
"design"=>"asymmetric","power"=>"tower"     
"design"=>"asymmetric","power"=>"tower"     


Upvotes: 1

Views: 481

Answers (1)

Corralien
Corralien

Reputation: 120479

Explode and expand your column other_tags then pivot it and finally join it to your original dataframe.

df = df.join(df['other_tags'].str.replace('"', '')
                 .str.split(',').explode().str.split('=>', expand=True)
                 .reset_index().pivot('index', 0, 1).fillna(''))

Output

>>> df
                                      other_tags        access agricultural  amenity            description      design  power
0  "access"=>"agricultural","agricultural"=>"no"  agricultural           no                                                   
1                          "access"=>"customers"     customers                                                                
2     "access"=>"customers","amenity"=>"parking"     customers               parking                                          
3         "description"=>"GRAND PERE DE MON AMI"                                      GRAND PERE DE MON AMI                   
4                    "description"=>"GRAND PERE"                                                 GRAND PERE                   
5        "design"=>"asymmetric","power"=>"tower"                                                             asymmetric  tower
6        "design"=>"asymmetric","power"=>"tower"                                                             asymmetric  tower

If you want to keep only the 2 columns access and amenity:

COLS_TO_KEEP = ['access', 'amenity']

df = df.join(df['other_tags'].str.replace('"', '')
                 .str.split(',').explode().str.split('=>', expand=True)
                 .reset_index().pivot('index', 0, 1)[COLS_TO_KEEP].fillna(''))

Output:

>>> df
                                      other_tags        access  amenity
0  "access"=>"agricultural","agricultural"=>"no"  agricultural         
1                          "access"=>"customers"     customers         
2     "access"=>"customers","amenity"=>"parking"     customers  parking
3         "description"=>"GRAND PERE DE MON AMI"                       
4                    "description"=>"GRAND PERE"                       
5        "design"=>"asymmetric","power"=>"tower"                       
6        "design"=>"asymmetric","power"=>"tower"                       

Update

Getting this error ValueError: Index contains duplicate entries, cannot reshape

If I slightly modify your dataframe:

                                      other_tags
0  "access"=>"agricultural","agricultural"=>"no"
1                          "access"=>"customers"
2     "access"=>"customers","amenity"=>"parking"
3         "description"=>"GRAND PERE DE MON AMI"
4                    "description"=>"GRAND PERE"
5        "design"=>"asymmetric","power"=>"tower"
6        "design"=>"asymmetric","power"=>"tower"
7            "access"=>"BONJOUR","access"=>"TOI"  # <- same tag on the same row

Use pivot_table instead of pivot:

COLS_TO_KEEP = ['access', 'amenity']
df = df.join(
  df['other_tags'].str.replace('"', '')
    .str.split(',').explode().str.split('=>', expand=True)
    .rename(columns={0: 'tag', 1: 'value'}).reset_index()
    .pivot_table('value', 'index', 'tag', aggfunc=','.join)[COLS_TO_KEEP].fillna('')
)

Output:

>>> df
                                      other_tags        access  amenity
0  "access"=>"agricultural","agricultural"=>"no"  agricultural         
1                          "access"=>"customers"     customers         
2     "access"=>"customers","amenity"=>"parking"     customers  parking
3         "description"=>"GRAND PERE DE MON AMI"                       
4                    "description"=>"GRAND PERE"                       
5        "design"=>"asymmetric","power"=>"tower"                       
6        "design"=>"asymmetric","power"=>"tower"                       
7            "access"=>"BONJOUR","access"=>"TOI"   BONJOUR,TOI         
#                                             HERE ---^-----^

Upvotes: 2

Related Questions