tsb
tsb

Reputation: 53

Searching for near matching elements of list

I have lists like this:

boo = ['<a>', '<b>', '<c>', '</c>', '</b>', '</a>']

I'm trying to iterate over them and find matching indexes like '<c>', '</c>' and remove those pieces. They have to be next to each other and matching in order to be removed. After the indices are removed, it iterates over the list again and keeps removing until the list is empty or until it cannot anymore.

I'm thinking something like:

  for i in range(len(boo)): 
    for b in boo:
       if  boo[i]== '</'+ b +'>' and boo[i-1] == '<' + b +'>':
         boo.remove(boo[i])
         boo.remove(boo[i-1])
         print(boo)

but that doesn't appear to be doing anything. Can someone point me to my problem?

EDIT

I changed it to more like this, but it is saying i is not defined. How is what I have not defining i?

def valid_html1(test_strings):
    valid = []
    for h in test_strings:
      boo = re.findall('\W+\w+\W', h)
      while i in boo == boo[i]:
         if boo[i][1:] == boo[i+1][2:]:
             boo.remove(boo[i])
             boo.remove(boo[i+1])
             print(boo)

valid_html1(example_set)

Upvotes: 0

Views: 88

Answers (3)

blhsing
blhsing

Reputation: 106445

You should parse the strings to extract the tag names from the angle brackets before you make comparisons. You can use zip to pair adjacent tags, and keep appending items to a new list only if its adjacent item is not of the same name:

boo = ['<a>', '<b>', '<c>', '</c>', '</b>', '</a>']
while True:
    pairs = zip(boo, boo[1:] + [''])
    new_boo = []
    for a, b in pairs:
        if a.startswith('<') and a.endswith('>') and \
                b.startswith('</') and b.endswith('>') and a[1:-1] == b[2:-1]:
            next(pairs)
            boo = new_boo
            boo.extend(a for a, _ in pairs)
            break
        new_boo.append(a)
    else:
        break
print(boo)

This outputs:

[]

And if boo = ['<a>', '<b>', '<c>', '</c>', '</b>', '</a>', '<d>'], this outputs:

['<d>']

Upvotes: 1

In 99% of the cases you shouldn't be editing a list while iterating.

This solution makes a copy and then edits the original list:

boo_copy = boo[:]
for i, b in enumerate(boo_copy)
   if i == 0:
      continue

   stripped_tag = b.replace("</","").replace(">","").replace("<","") # Removes first and last char to remove '<' and '>'
   if  boo[i]== '</'+ stripped_tag +'>' and boo[i-1] == '<' + stripped_tag +'>':
      boo.remove(boo[i])
      boo.remove(boo[i-1])
      print(boo)

This assumes that the tags are unique in the list.

Upvotes: 0

Mateen Ulhaq
Mateen Ulhaq

Reputation: 27201

import re

def open_tag_as_str(tag):
    m = re.match(r'^<(\w+)>$', tag)
    return None if m is None else m.group(1)

def close_tag_as_str(tag):
    m = re.match(r'^</(\w+)>$', tag)
    return None if m is None else m.group(1)

def remove_adjacent_tags(tags):
    def closes(a, b):
        a = open_tag_as_str(a)
        b = close_tag_as_str(b)
        return a is not None and b is not None and a == b

    # This is a bit ugly and could probably be improved with
    # some itertools magic or something
    skip = False
    for i in range(len(tags)):
        if skip:
            skip = False
        elif i + 1 < len(tags) and closes(tags[i], tags[i + 1]):
            skip = True
        else:
            yield tags[i]

boo = ['<a>', '<b>', '<c>', '</c>', '</b>', '</a>']
boo = list(remove_adjacent_tags(boo))
print(boo)

Gives:

['<a>', '<b>', '</b>', '</a>']

Upvotes: 0

Related Questions