avgjoe13
avgjoe13

Reputation: 115

Extracting words from a column that are not in list and creating new column

I would like to extract a specific substring from a string value in a column. In the code below you'll see that I've managed to extract the year, make and model.

import pandas as pd
import numpy as np

data = [['year_make_model'], 
    ['1970 Plymouth Hemi Cuda'], 
    ['1970 Dodge Challenger R/T SE Convertible'], 
    ['1971 Plymouth Cuda']]

data = pd.DataFrame(data[1:], columns=data[0])

def create_year_make_model(df):
    df['year'] = df.year_make_model.str.extract('(\d\d\d\d)', expand=True) 
    df['make'] = np.where(df.year_make_model.str.contains('Plymouth'), 'Plymouth',
                             np.where(df.year_make_model.str.contains('Dodge'), 'Dodge', 'Nan'))
    df['model'] = np.where(df.year_make_model.str.contains('Cuda'), 'Cuda',
                             np.where(df.year_make_model.str.contains('Challenger'), 'Challenger', 'Nan'))
    return df.head()                   

create_year_make_model(data)

                             year_make_model    year        make         model
0   1970 Plymouth Hemi Cuda                     1970    Plymouth          Cuda
1   1970 Dodge Challenger R/T SE Convertible    1970    Dodge       Challenger
2   1971 Plymouth Cuda                          1971    Plymouth          Cuda

Now I need to extract all the remaining words (if there are any). The remaining words make up the vehicle 'type' (see example below).

car_type = ['Hemi', 'R/T SE Convertible', None]
data['type'] = car_type
data 

                             year_make_model    year        make         model                   type
0   1970 Plymouth Hemi Cuda                     1970    Plymouth          Cuda                   Hemi
1   1970 Dodge Challenger R/T SE Convertible    1970    Dodge       Challenger     R/T SE Convertible
2   1971 Plymouth Cuda                          1971    Plymouth          Cuda                   None

I've tried many different approaches so far without luck. One approach that I've tried is creating a list for all the year, make and model combinations word_list = ['Cuda','Challenger','1970','1971','Dodge','Plymouth']. I then made a for loop trying to extract the words in column 'year_make_model' which are not in the word_list but I couldn't get it to work. I hope someone knows a need trick to solve my problem, thanks in advance!

Upvotes: 0

Views: 120

Answers (1)

Ollie
Ollie

Reputation: 127

I'm not sure whether this is the best way to solve it, but it does seem to work:

import numpy as np
import pandas as pd

data = [['year_make_model'], 
    ['1970 Plymouth Hemi Cuda'], 
    ['1970 Dodge Challenger R/T SE Convertible'], 
    ['1971 Plymouth Cuda']]

data = pd.DataFrame(data[1:], columns=data[0])

def create_year_make_model(df):
    cell_text = df.year_make_model.str

    df['year'] = cell_text.extract('(\d\d\d\d)', expand=True)

    df['make'] = np.where(cell_text.contains('Plymouth'), 'Plymouth',
                           np.where(cell_text.contains('Dodge'), 'Dodge', 'Nan'))

    df['model'] = np.where(cell_text.contains('Cuda'), 'Cuda',
                             np.where(cell_text.contains('Challenger'), 'Challenger', 'Nan'))

    # what fields to remove
    rm_fields = ["year", "make", "model"]

    for field in rm_fields:
      # remove this field
      for val in getattr(df, field).values:
        cell_text = cell_text.replace(str(val), "").str

    # clean any leading/trailing spaces
    cell_text = cell_text.replace(r"^\s+", "")
    cell_text = cell_text.replace(r"\s+$", "")

    df['type'] = cell_text

    return df.head()    

print(create_year_make_model(data))

So after you have extracted the year, make and model fields you can go back and get the extracted values for those fields and do replace them with spaces.

Also any possible leading/trailing whitespace should be removed for consistency.

Upvotes: 1

Related Questions