helloWorld
helloWorld

Reputation: 1

How to merge two different data frames columns based on str.contains()

I have one dataframe called df_companies.

Òutput: 
    company     brand 
0   VW-Konzern  volkswagen
1   VW-Konzern  audi
2   VW-Konzern  bentley
3   VW-Konzern  bugatti
4   VW-Konzern  lamborghini

In the next step i receive two strings within a for loop through some string formatting. After that i am trying to check, if the 'companyName' string is contained in the column 'brand' of the dataframe 'df_companies'.

If so, then add the logo_url string two the dataframe df_companies into column 'image_url'.

for image in images:
    companyName = image['alt'].lower().split(' ', 1)[0]
    logo_url = image['src']

    df_companies['image_url'] = np.where(df_companies['brand'].str.contains(companyName), logo_url, 'other')

This works so far for the first row. For the rest of the remaining rows it only enters the string 'other' as defined above.

Output: 
        company     brand       image_url
0       VW-Konzern  volkswagen  https://imgr.volkswagen.png
1       VW-Konzern  audi        Other
2       VW-Konzern  bentley     Other
3       VW-Konzern  bugatti     Other
4       VW-Konzern  lamborghini Other

What i would like to achieve is the following output:

Output: 
        company     brand       image_url
0       VW-Konzern  volkswagen  https://imgr.volkswagen.png
1       VW-Konzern  audi        https://imgr.audi.png
2       VW-Konzern  bentley     https://imgr.bentley.png
3       VW-Konzern  bugatti     https://imgr.audi.png
4       VW-Konzern  lamborghini https://imgr.audi.png

The issue is that the 'companyName' string and the string in the 'brand' column of dataframe 'df_companies' only match partially, why i cannot use the common merge function.

Any ideas how to solve that issue?

Thanks in advance for any help!

Upvotes: 0

Views: 54

Answers (2)

helloWorld
helloWorld

Reputation: 1

I could find a solution. I had to put in df_companies['image_url']into the else condition of np.whereinsteaf of 'other'.

df_companies['image_url'] = ''

    for image in images:

        companyName = image['alt'].lower().split(' ', 1)[0]
        logo_url = image['src']

        df_companies['image_url'] = np.where(df_companies['brand'].str.contains(companyName), logo_url, df_companies['image_url'],)

Afterwards i got the expected output:

Output:
    company     brand       image_url
0   VW-Konzern  volkswagen  https://imgr1.auto-motor-und-sport.de/Volkswagen-Logo-brandtreeIndexSmall-f4ae98d2-941118.jpg
1   VW-Konzern  audi        https://imgr1.auto-motor-und-sport.de/Audi-Logo-Neu-02-2019-brandtreeIndexSmall-44d434cb-1430808.jpg
2   VW-Konzern  bentley     https://imgr1.auto-motor-und-sport.de/Bentley-Logo-brandtreeIndexSmall-41a61902-702423.jpg
3   VW-Konzern  bugatti     https://imgr1.auto-motor-und-sport.de/Bugatti-Logo-brandtreeIndexSmall-252888c-946200.jpg
4   VW-Konzern  lamborghini https://imgr1.auto-motor-und-sport.de/Lamborghini-Logo-brandtreeIndexSmall-df88a8a3-945695.jpg 

Upvotes: 0

moys
moys

Reputation: 8033

You can use regex=True within the str.contains method & even a partial name should match. For example as below

df['image_url'] = np.where(df['brand'].str.contains('au' , regex=True), 'logo_url', 'other')

Output I just used text logo_url instead of the url.

company     brand           image_url
VW-Konzern  volkswagen      other
VW-Konzern  audi            logo_url
VW-Konzern  bentley         other
VW-Konzern  bugatti         other
VW-Konzern  lamborghini     other

Upvotes: 1

Related Questions