asguldbrandsen
asguldbrandsen

Reputation: 173

Extract characters within certain symbols

I have extracted text from an HTML file, and have the whole thing in a string.

I am looking for a method to loop through the string, and extract only values that are within square brackets and put strings in a list.

I have looked in to several questions, among them this one: Extract character before and after "/"

But i am having a hard time modifying it. Can someone help?

Solved!

Thank you for all your inputs, I will definitely look more into regex. I managed to do what i wanted in a pretty manual way (may not be beautiful):

#remove all html code and append to string
for i in html_file:
    html_string += str(html2text.html2text(i))

#set this boolean if current character is either [ or ]
add = False

#extract only values within [ or ], based on add = T/F
for i in html_string:
    if i == '[':
        add = True
    if i == ']': 
        add = False
        clean_string += str(i)
    if add == True:
        clean_string += str(i)

#split string into list without square brackets
clean_string_list = clean_string.split('][')

The HTML file I am looking to get as pure text (dataframe later on) instead of HTML, is my personal Facebook data that i have downloaded.

Upvotes: 0

Views: 72

Answers (2)

hygull
hygull

Reputation: 8740

You can also use re.finditer() for this, see below example.

Let suppose, we have word characters inside brackets so regular expression will be \[\w+\].

If you wish, check it at https://rextester.com/XEMOU85362.

import re

s = "<h1>Hello [Programmer], you are [Excellent]</h1>"
g = re.finditer("\[\w+\]", s) 
l = list() # or, l = []

for m in g: 
    text = m.group(0)
    l.append(text[1: -1]) 

print(l) # ['Programmer', 'Excellent']

Upvotes: 1

nick_14159
nick_14159

Reputation: 49

Try out this regex, given a string it will place all text inside [ ] into a list.

import re
print(re.findall(r'\[(\w+)\]','spam[eggs][hello]'))
>>> ['eggs', 'hello']

Also this is a great reference for building your own regex. https://regex101.com

EDIT: If you have nested square brackets here is a function that will handle that case.

import re
test ='spam[eg[nested]gs][hello]'

def square_bracket_text(test_text,found):
    """Find text enclosed in square brackets within a string"""
    matches = re.findall(r'\[(\w+)\]',test_text)
    if matches:
        found.extend(matches)
        for word in found:
            test_text = test_text.replace('[' + word + ']','')
        square_bracket_text(test_text,found)
    return found

match = []
print(square_bracket_text(test,match))
>>>['nested', 'hello', 'eggs']

hope it helps!

Upvotes: 1

Related Questions