Reputation: 1679
I have two lists, a
and b
. They look like this:
a = [
'And',
"you're",
'going',
'to',
'use',
'some',
'handouts.',
'Okay.',
'So',
'I',
'needed',
'to',
'know',
'and',
'for,'
]
b = [
'And',
"you're",
'going',
'to',
'use',
'some',
'handouts.',
'Okay.',
'I',
'needed',
'to',
'know',
'and',
'for,',
'it'
]
I want to ensure that they can zip together and match. However, they do not as can be seen here:
x = list(zip(a,b))
for i in x:
print(i)
('And', 'And')
("you're", "you're")
('going', 'going')
('to', 'to')
('use', 'use')
('some', 'some')
('handouts.', 'handouts.')
('Okay.', 'Okay.')
---> ('So', 'I')
('I', 'needed')
('needed', 'to')
('to', 'know')
('know', 'and')
('and', 'for,')
('for,', 'it')
It can be seen that a
contains 'So' and b
does not. To fix this, I want to drop 'So' from a
, which would result in this:
('And', 'And')
("you're", "you're")
('going', 'going')
('to', 'to')
('use', 'use')
('some', 'some')
('handouts.', 'handouts.')
('Okay.', 'Okay.')
('I', 'I')
('needed', 'needed')
('to', 'to')
('know', 'know')
('and', 'and')
('for', 'for,')
('it,', 'it')
Essentially, I a word exists in one list but not the other list within the general index area, I want to remove it, regardless if it is in a
or b
. I have used the fuzzywuzzy library for fuzzy matching, which does decently well, but it is very slow. Are there more efficient ways to do this?
Upvotes: 0
Views: 55
Reputation: 44148
The idea is to remove from a
those items which are not in b
and vice versa. Using sets are the way to compute this efficiently for large lists:
a = [
'And',
"you're",
'going',
'to',
'use',
'some',
'handouts.',
'Okay.',
'So',
'I',
'needed',
'to',
'know',
'and',
'for,'
]
b = [
'And',
"you're",
'going',
'to',
'use',
'some',
'handouts.',
'Okay.',
'I',
'needed',
'to',
'know',
'and',
'for,',
'it'
]
set_a = set(a)
set_b = set(b)
remove_a = set_a - set_b
for item in remove_a:
a.remove(item)
remove_b = set_b - set_a
for item in remove_b:
b.remove(item)
x = list(zip(a,b))
for item in x:
print(item)
Prints:
('And', 'And')
("you're", "you're")
('going', 'going')
('to', 'to')
('use', 'use')
('some', 'some')
('handouts.', 'handouts.')
('Okay.', 'Okay.')
('I', 'I')
('needed', 'needed')
('to', 'to')
('know', 'know')
('and', 'and')
('for,', 'for,')
Upvotes: 1
Reputation: 429
I don't know if this would be faster, but I think you could just use two list comprehensions:
original_a = [ 'And', "you're", 'going' ] # etc
original_b = [ 'And', "you're", 'going' ] # etc
common_a = [x for x in original_a if x in original_b]
common_b = [x for x in original_b if x in original_a]
zipped_result = zip(common_a, common_b)
this should preserve order and I think get you what you want.
Upvotes: 1
Reputation: 16147
c = set(a) & set(b)
# if the order does not matter
list(zip(c,c))
# If the order does matter
list(zip([x for x in a if x in c],
[x for x in b if x in c]))
Upvotes: 1