Reputation: 173
I have extracted text from an HTML file, and have the whole thing in a string.
I am looking for a method to loop through the string, and extract only values that are within square brackets and put strings in a list.
I have looked in to several questions, among them this one: Extract character before and after "/"
But i am having a hard time modifying it. Can someone help?
Solved!
Thank you for all your inputs, I will definitely look more into regex. I managed to do what i wanted in a pretty manual way (may not be beautiful):
#remove all html code and append to string
for i in html_file:
html_string += str(html2text.html2text(i))
#set this boolean if current character is either [ or ]
add = False
#extract only values within [ or ], based on add = T/F
for i in html_string:
if i == '[':
add = True
if i == ']':
add = False
clean_string += str(i)
if add == True:
clean_string += str(i)
#split string into list without square brackets
clean_string_list = clean_string.split('][')
The HTML file I am looking to get as pure text (dataframe later on) instead of HTML, is my personal Facebook data that i have downloaded.
Upvotes: 0
Views: 72
Reputation: 8740
You can also use re.finditer()
for this, see below example.
Let suppose, we have word characters inside brackets so regular expression will be
\[\w+\]
.If you wish, check it at https://rextester.com/XEMOU85362.
import re
s = "<h1>Hello [Programmer], you are [Excellent]</h1>"
g = re.finditer("\[\w+\]", s)
l = list() # or, l = []
for m in g:
text = m.group(0)
l.append(text[1: -1])
print(l) # ['Programmer', 'Excellent']
Upvotes: 1
Reputation: 49
Try out this regex, given a string it will place all text inside [ ] into a list.
import re
print(re.findall(r'\[(\w+)\]','spam[eggs][hello]'))
>>> ['eggs', 'hello']
Also this is a great reference for building your own regex. https://regex101.com
EDIT: If you have nested square brackets here is a function that will handle that case.
import re
test ='spam[eg[nested]gs][hello]'
def square_bracket_text(test_text,found):
"""Find text enclosed in square brackets within a string"""
matches = re.findall(r'\[(\w+)\]',test_text)
if matches:
found.extend(matches)
for word in found:
test_text = test_text.replace('[' + word + ']','')
square_bracket_text(test_text,found)
return found
match = []
print(square_bracket_text(test,match))
>>>['nested', 'hello', 'eggs']
hope it helps!
Upvotes: 1