Emm
Emm

Reputation: 2507

Comparing values in one column to values in a column in a second dataframe

I have 2 dataframes, each of the dataframe has a RegionName column. I would like to create a new column that will has the value 'true' if the RegionName in dataframe A appears under RegionName in dataframe B.

I have written a lambda function that attempts to do this but returns false for all items in my dataframe (which is not the case). I have also attempted to use np.where() but to no avail

housing = convert_housing_data_to_quarters()
housing = housing.iloc[:,[34,35,36,37]]
university = get_list_of_university_towns()
housing = housing.reset_index()
housing['University City'] = housing.apply(lambda x: x['RegionName'] in university['RegionName'], axis=1)

This is my attempt at using np.where()

housing['University City'] = np.where(housing['RegionName'] == university['RegionName'](axis=1),'true','false')

Upvotes: 3

Views: 94

Answers (3)

BENSALEM Mouhssen
BENSALEM Mouhssen

Reputation: 71

please try the fellowing code, i make it for you as a tutorial ;)

import pandas as pd
import numpy as np

region_names = ["region_a", "region_b", "region_c", "region_d", "region_e", 
"region_f", "region_g", "region_h"]

# Generate 100 random university names
univ_names = ["univ-%s"%i for i in range(100)]

# Select 100 random university regions from region_names
univ_regions = [region_name[np.random.randint(low=0, high=len(region_names))] for i     
in range(len(univ_names))]

# make a universities DataFrame >> DataFrame1
universities = pd.DataFrame(data=list(zip(univ_names, univ_regions)), columns=    
["univ_name", "univ_region"])

# Now, you select the half number of regions, to make comparaison >> DataFrame2
regions =pd.DataFrame(data=[region_names[np.random.randint(low=0,     
high=len(region_names))] for i in range(len(region_names) // 2)], columns= 
["region_name"])

# get unique list of region name from DataFrame2
unique_regions = regions["region_name"].unique()

# Create the third column in DataFrame1 >> bool column
universities["isin_unique_regions"] = universities.univ_region.isin(unique_regions)
universities.head()

Upvotes: 0

jezrael
jezrael

Reputation: 862511

Use isin for compare by multiple values of column of another DataFrame for boolean mask:

mask = housing['RegionName'].isin(university['RegionName'])
housing['University City'] = mask
#if need True, False to 1,0 mapping
#housing['University City'] = mask.astype(int)

Another solution is use numpy.where what is better if want create new values by mask:

housing['University City'] = np.where(mask, 'match', 'no match')

Upvotes: 2

jpp
jpp

Reputation: 164623

Pandas series have dictionary-like properties. Consider the following:

s = pd.Series(['a', 'b', 'c', 'd'], index=[10, 11, 12, 13])

'a' in s  # False
10 in s   # True

So you are in fact checking for existence in a series index rather than series values. In addition, apply + lambda is an inefficient, Python-level loop.

With Pandas you should look to vectorise operations, e.g. via pd.Series.isin, which implicitly uses values. In addition, it will likely be efficient to make your comparison series unique first:

unique_vals = university['RegionName'].unique()
housing['University City'] = housing['RegionName'].isin(unique_vals)

This creates a Boolean series. If 0 / 1 is a requirement, you can then convert to int:

housing['University City'] = housing['University City'].astype(int)

Upvotes: 3

Related Questions