thesecond
thesecond

Reputation: 445

Filter Series/DataFrame by another DataFrame

Let's suppose I have a Series (or DataFrame) s1, for example list of all Universities and Colleges in the USA:

                     University
0            Searcy Harding University
1          Angwin Pacific Union College
2    Fairbanks University of Alaska Fairbanks
3        Ann Arbor University of Michigan

And another Series (od DataFrame) s2, for example list of all cities in the USA:

      City
0    Searcy
1    Angwin 
2   New York 
3   Ann Arbor 

And my desired output (bascially an intersection of s1 and s2):

     Uni City
0     Searcy
1     Angwin 
2    Fairbanks 
3    Ann Arbor 

The thing is: I'd like to create a Series that consists of cities but only these, that have a university/college. My very first thought was to remove "University" or "College" parts from the s1, but it turns out that it is not enough, as in case of Angwin Pacific Union College. Then I thought of leaving only the first word, but that excludes Ann Arbor. Finally, I got a series of all the cities s2 and now I'm trying to use it as a filter (something similiar to .contains() or .isin()), so if a string s1 (Uni name) contains any of the elements of s2 (city name), then return only the city name.

My question is: how to do it in a neat way?

Upvotes: 2

Views: 67

Answers (2)

wwnde
wwnde

Reputation: 26676

Data Used

s=pd.DataFrame({'University':['Searcy Harding University','Angwin Pacific Union College','Fairbanks University of Alaska Fairbanks','Ann Arbor University of Michigan']})


s2=pd.DataFrame({'City':['Searcy','Angwin','Fairbanks','Ann Arbor']})

Convert s2.City to set to create an iterator

st=set(s2.City.unique().tolist())

Calculate s['Uni City'] using the next() function to return the next item from the iterator.

s['Uni City']=s['University'].apply(lambda x: next((i for i in st if i in x)), np.nan)

Outcome

enter image description here

Upvotes: 1

Serge Ballesta
Serge Ballesta

Reputation: 148910

I would try to build a list comprehension of cities that are contained in at least one university name:

pd.Series([i for i in s2 if s1.str.contains(i).any()], name='Uni City')

With your example data it gives:

0       Searcy
1       Angwin
2    Ann Arbor
Name: Uni City, dtype: object

Upvotes: 1

Related Questions