Reputation: 658
I have a column with text that contains subheadings, such as '1. DESCRIPTION', '2. FOO', etc. I have all the possible subheadings in a list, but the issue is that not every entry in the column contains every subheading. I want to add columns to the df for every possible subheading and add the corresponding text after the subheadings into these columns.
A minimal example:
Text
'1. Description: example description here. 3. BAR: more text'
'1. Description: second example. 2. FOO: a foo'
should become
Description | Foo | Bar
example description here | | more text
second example | a foo |
I've tried making a function that converts a string and list of possible subheadings into a dictionary, with the idea of .apply()-ing it to the df. It works, but is not neat:
def split_into_dict(input_string, separators):
seps_in_string = []
for i in len(separators):
if separators[i] in input_string:
seps_in_string.append(sep)
split_text = []
for i in len(seps_in_string):
[text_part, input_string] = input_string.split(seps_in_string[i])
split_text.append(text_part)
return dict(zip(seps_in_string, split_text[1:]))
I'm not sure this is a good idea in general, but I'm also struggling on how to then use this function to create new columns.
Upvotes: 0
Views: 59
Reputation: 2484
I tried a solution with some regex
df = pd.DataFrame([
{"Text": '1. Description: example description here. 3. BAR: more text'},
{"Text": '1. Description: second example. 2. FOO: a foo'}
])
# regex to capture the columns names
reg_key = re.compile("([A-Za-z]*)\:")
# regex to capture the values
reg_value = re.compile("\: ([A-Za-z1-9 ]*)")
output = pd.DataFrame()
for index, row in df.iterrows():
txt = row['Text']
# Find keys and values
keys = re.findall(reg_key, txt)
values = re.findall(reg_value, txt)
for i, key in enumerate(keys):
output.at[index, key] = values[i]
Output :
Description BAR FOO
0 example description here more text NaN
1 second example NaN a foo
Upvotes: 1