Iterating a function overwrites the dataframe each time

Question

so I am converting multiple docx files to a dataframe file. The code works for one document and this leads to the following structure:

data = {'Title': ['title first article, 'title second article'], 'Sources': ['source of first article', 'source of second article']}
df = pd.DataFrame(data=data)

The structure is the result from a function:

def func_convert_updates(filename):
    path = os.chdir('C:/Users/docxfiles')
    regex = '\xc2\xb7'
    with open(filename, "rb") as docx_file:
        result = mammoth.convert_to_html(docx_file)
        text = result.value # The raw text
        text2=re.sub(u'[|•●]', " ", text, count= 0) 
        with open('output.txt', 'w', encoding='utf-8') as text_file:
            text_file.write(text2)

    #followed by many lines of code, omitted here, to create a dataframe

    return df_titles

And then I want to analyse multiple docx files so therefore I wrote the following code:

list_news= ['docx_file_1', 'docx_file_2.docx', ... etc]

for element in list_news:
    df_titles = func_convert_updates(element)

However, this only returns the dataframe of the last element of the list as it overwrites previous output. How can I solve this? Thank you in advance.

bruno-uy · Accepted Answer

If you want to have all the DataFrames you created in each loop in the variable df_titles you can do something like this:

import pandas as pd

df_titles = pd.concat([func_convert_updates(element) for element in list_news], ignore_index=True)

Iterating a function overwrites the dataframe each time

Answers (2)

Related Questions