New
New

Reputation: 25

Comparing two sets of data with Intersection in Python

When comparing two sets, following_id and follower_id, the return result seems to be splitting everything.

import re
id1 = '[User(ID=1234567890, ScreenName=RandomNameHere), User(ID=233323490,      ScreenName=AnotherRandomName), User(ID=4459284, ScreenName=YetAnotherName)]'
id2 = '[User(ID=1234467890, ScreenName=sdf), User(ID=233323490,  ScreenName=AnotherRandomName), User(ID=342, ScreenName=443)]'

following_id = ', '.join( re.findall(r'ID=(\d+)', id1) )
follower_id = ', '.join( re.findall(r'ID=(\d+)', id2) )

a = list(set(following_id).intersection(follower_id))
print a

This results with [' ', ',', '1', '0', '3', '2', '5', '4', '7', '6', '9', '8']

I would like the results to be ['233323490','54321'] which are the two IDs that match between the two sets.

The following works for me:

list1 = [1234567890, 233323490, 4459284, 230, 200, 234, 200, 0002]
list2 = [1234467890, 233323490, 342, 101, 234]
a = list(set(list1).intersection(list2))
print a

With a result of [233323490, 234]

Does this have to do with the datatype for following_id and follower_id?

Upvotes: 0

Views: 225

Answers (2)

Trey Hunner
Trey Hunner

Reputation: 11814

following_id and follower_id are strings. When you convert a string to a set, you'll get a set of each of the characters:

>>> set('hello, there')
{' ', 'o', 't', 'e', 'r', 'h', ',', 'l'}

When making the set, Python doesn't care about the commas or spaces in your string... it just iterates over the characters treating each as an item in the new set.

You're looking for a set of strings. So you need to pass something that contains strings to then turn into a set. re.findall should give you make a list of strings. If you don't join them together, you should be able to take the intersection and get what you're looking for:

following_id = re.findall(r'ID=(\d+)', id1)
follower_id = re.findall(r'ID=(\d+)', id2)

a = list(set(following_id).intersection(follower_id))

Upvotes: 0

Darkstarone
Darkstarone

Reputation: 4730

This is because you're making strings with .join, not lists:

following_id = ', '.join( re.findall(r'ID=(\d+)', id1) )
follower_id = ', '.join( re.findall(r'ID=(\d+)', id2) )
print(following_id) # '1234567890, 233323490, 4459284'
print(follower_id) # '1234467890, 233323490, 342'

You just need to use:

following_id = re.findall(r'ID=(\d+)', id1)
follower_id = re.findall(r'ID=(\d+)', id2)

As re.findall already returns a list of matches.

Upvotes: 1

Related Questions