Reputation: 2507
I have 2 dataframes
, each of the dataframe
has a RegionName
column. I would like to create a new column that will has the value 'true' if the RegionName
in dataframe
A appears under RegionName
in dataframe
B.
I have written a lambda function that attempts to do this but returns false for all items in my dataframe
(which is not the case). I have also attempted to use np.where()
but to no avail
housing = convert_housing_data_to_quarters()
housing = housing.iloc[:,[34,35,36,37]]
university = get_list_of_university_towns()
housing = housing.reset_index()
housing['University City'] = housing.apply(lambda x: x['RegionName'] in university['RegionName'], axis=1)
This is my attempt at using np.where()
housing['University City'] = np.where(housing['RegionName'] == university['RegionName'](axis=1),'true','false')
Upvotes: 3
Views: 94
Reputation: 71
please try the fellowing code, i make it for you as a tutorial ;)
import pandas as pd
import numpy as np
region_names = ["region_a", "region_b", "region_c", "region_d", "region_e",
"region_f", "region_g", "region_h"]
# Generate 100 random university names
univ_names = ["univ-%s"%i for i in range(100)]
# Select 100 random university regions from region_names
univ_regions = [region_name[np.random.randint(low=0, high=len(region_names))] for i
in range(len(univ_names))]
# make a universities DataFrame >> DataFrame1
universities = pd.DataFrame(data=list(zip(univ_names, univ_regions)), columns=
["univ_name", "univ_region"])
# Now, you select the half number of regions, to make comparaison >> DataFrame2
regions =pd.DataFrame(data=[region_names[np.random.randint(low=0,
high=len(region_names))] for i in range(len(region_names) // 2)], columns=
["region_name"])
# get unique list of region name from DataFrame2
unique_regions = regions["region_name"].unique()
# Create the third column in DataFrame1 >> bool column
universities["isin_unique_regions"] = universities.univ_region.isin(unique_regions)
universities.head()
Upvotes: 0
Reputation: 862511
Use isin
for compare by multiple values of column of another DataFrame
for boolean mask:
mask = housing['RegionName'].isin(university['RegionName'])
housing['University City'] = mask
#if need True, False to 1,0 mapping
#housing['University City'] = mask.astype(int)
Another solution is use numpy.where
what is better if want create new values by mask:
housing['University City'] = np.where(mask, 'match', 'no match')
Upvotes: 2
Reputation: 164623
Pandas series have dictionary-like properties. Consider the following:
s = pd.Series(['a', 'b', 'c', 'd'], index=[10, 11, 12, 13])
'a' in s # False
10 in s # True
So you are in fact checking for existence in a series index rather than series values. In addition, apply
+ lambda
is an inefficient, Python-level loop.
With Pandas you should look to vectorise operations, e.g. via pd.Series.isin
, which implicitly uses values. In addition, it will likely be efficient to make your comparison series unique first:
unique_vals = university['RegionName'].unique()
housing['University City'] = housing['RegionName'].isin(unique_vals)
This creates a Boolean series. If 0
/ 1
is a requirement, you can then convert to int
:
housing['University City'] = housing['University City'].astype(int)
Upvotes: 3