Pranav_m7
Pranav_m7

Reputation: 160

Extract a string from a column of a dataframe and add a new column using that string

def comp():
    for car in df.name:
        x=car.split(' ')
        return x[0]
df.car=comp()

I wanted to extract the brand of the car from the 'name' column and make another column- 'car' using it to do some analysis but this code doesn't seem to work and the whole car column gets filled with the same value.

Upvotes: 1

Views: 756

Answers (1)

cs95
cs95

Reputation: 402463

The fundamental issue is that your return statement is inside the loop, so the result of the first iteration is returned. This compounds with the fact that you assign this single constant value back to the entire column, causing that value to be broadcasted across all rows which is why you see them all with the same value. What I would recommend doing is to create a function which operates on a single value (think of you iterating over a list of names and then applying your logic to a single value), then you call this function inside a loop or list comprehension to eventually build the complete column.

def try_split(val):
    try:
        return val.split()[0]
    except AttributeError:
        return np.nan
df = pd.DataFrame({'name': ['aaa bb', 'ccc', 'ddd ee ff', np.nan]})
df       

        name
0     aaa bb
1        ccc
2  ddd ee ff
3        NaN

df['car'] = [try_split(val) for val in df['name']]
df
        name  car
0     aaa bb  aaa
1        ccc  ccc
2  ddd ee ff  ddd
3        NaN  NaN

This is a list comprehension and is a pretty OK way of getting it done. It's not slower than the pandaic method (see below) and offers a good degree of flexibility and control over the function and error handling. I've written more about the use of list comprehensions in this post here: Are for-loops in pandas really bad? When should I care?


However, here's a more pandaic way of doing things: split on whitespace with str.split and take the first word using str[0]:

# str.split() splits on whitespace by default
df['car'] = df['name'].str.split().str[0]
df

        name  car
0     aaa bb  aaa
1        ccc  ccc
2  ddd ee ff  ddd
3        NaN  NaN

This isn't any more vectorized than the loop above, but definitely hides a lot of the complexity and corner casing logic behind the function call and is a lot more readable.

Upvotes: 1

Related Questions