Extract characters within certain symbols

Question

I have extracted text from an HTML file, and have the whole thing in a string.

I am looking for a method to loop through the string, and extract only values that are within square brackets and put strings in a list.

I have looked in to several questions, among them this one: Extract character before and after "/"

But i am having a hard time modifying it. Can someone help?

Solved!

Thank you for all your inputs, I will definitely look more into regex. I managed to do what i wanted in a pretty manual way (may not be beautiful):

#remove all html code and append to string
for i in html_file:
    html_string += str(html2text.html2text(i))

#set this boolean if current character is either [ or ]
add = False

#extract only values within [ or ], based on add = T/F
for i in html_string:
    if i == '[':
        add = True
    if i == ']': 
        add = False
        clean_string += str(i)
    if add == True:
        clean_string += str(i)

#split string into list without square brackets
clean_string_list = clean_string.split('][')

The HTML file I am looking to get as pure text (dataframe later on) instead of HTML, is my personal Facebook data that i have downloaded.

nick_14159 · Accepted Answer

Try out this regex, given a string it will place all text inside [ ] into a list.

import re
print(re.findall(r'$$(\w+)$$','spam[eggs][hello]'))
>>> ['eggs', 'hello']

Also this is a great reference for building your own regex. https://regex101.com

EDIT: If you have nested square brackets here is a function that will handle that case.

import re
test ='spam[eg[nested]gs][hello]'

def square_bracket_text(test_text,found):
    """Find text enclosed in square brackets within a string"""
    matches = re.findall(r'$$(\w+)$$',test_text)
    if matches:
        found.extend(matches)
        for word in found:
            test_text = test_text.replace('[' + word + ']','')
        square_bracket_text(test_text,found)
    return found

match = []
print(square_bracket_text(test,match))
>>>['nested', 'hello', 'eggs']

hope it helps!

Extract characters within certain symbols

Answers (2)

Related Questions