Chris Schmitz
Chris Schmitz

Reputation: 658

Split column of strings by list of possible substrings

I have a column with text that contains subheadings, such as '1. DESCRIPTION', '2. FOO', etc. I have all the possible subheadings in a list, but the issue is that not every entry in the column contains every subheading. I want to add columns to the df for every possible subheading and add the corresponding text after the subheadings into these columns.

A minimal example:

Text
'1. Description: example description here. 3. BAR: more text'
'1. Description: second example. 2. FOO: a foo'

should become

Description              | Foo   | Bar
example description here |       | more text
second example           | a foo |

I've tried making a function that converts a string and list of possible subheadings into a dictionary, with the idea of .apply()-ing it to the df. It works, but is not neat:

def split_into_dict(input_string, separators):
    seps_in_string = []

    for i in len(separators):
        if separators[i] in input_string:
            seps_in_string.append(sep)
    
    split_text = []
    for i in len(seps_in_string):
        [text_part, input_string] = input_string.split(seps_in_string[i])
        split_text.append(text_part)
    
    return dict(zip(seps_in_string, split_text[1:]))

I'm not sure this is a good idea in general, but I'm also struggling on how to then use this function to create new columns.

Upvotes: 0

Views: 59

Answers (1)

AlexisG
AlexisG

Reputation: 2484

I tried a solution with some regex

df = pd.DataFrame([
    {"Text": '1. Description: example description here. 3. BAR: more text'},
    {"Text": '1. Description: second example. 2. FOO: a foo'}
])

# regex to capture the columns names
reg_key = re.compile("([A-Za-z]*)\:")

# regex to capture the values
reg_value = re.compile("\: ([A-Za-z1-9 ]*)")

output = pd.DataFrame()
for index, row in df.iterrows():
    txt = row['Text']

    # Find keys and values
    keys = re.findall(reg_key, txt)
    values = re.findall(reg_value, txt)
    
    for i, key in enumerate(keys):
        output.at[index, key] = values[i]

Output :

                Description        BAR    FOO
0  example description here  more text    NaN
1            second example        NaN  a foo

Upvotes: 1

Related Questions