Tobias
Tobias

Reputation: 135

Iterating a function overwrites the dataframe each time

so I am converting multiple docx files to a dataframe file. The code works for one document and this leads to the following structure:

data = {'Title': ['title first article, 'title second article'], 'Sources': ['source of first article', 'source of second article']}
df = pd.DataFrame(data=data)

The structure is the result from a function:

def func_convert_updates(filename):
    path = os.chdir('C:/Users/docxfiles')
    regex = '\xc2\xb7'
    with open(filename, "rb") as docx_file:
        result = mammoth.convert_to_html(docx_file)
        text = result.value # The raw text
        text2=re.sub(u'[|•●]', " ", text, count= 0) 
        with open('output.txt', 'w', encoding='utf-8') as text_file:
            text_file.write(text2)

    #followed by many lines of code, omitted here, to create a dataframe

    return df_titles

And then I want to analyse multiple docx files so therefore I wrote the following code:

list_news= ['docx_file_1', 'docx_file_2.docx', ... etc]

for element in list_news:
    df_titles = func_convert_updates(element)

However, this only returns the dataframe of the last element of the list as it overwrites previous output. How can I solve this? Thank you in advance.

Upvotes: 0

Views: 46

Answers (2)

Sunward
Sunward

Reputation: 1

The actual problem is that if you call your function multiple times you tell open to write to 'output.txt' file, overwriting the file if it exists, with the 'w' argument. You might want to change that to 'a' to append to the file, so :

with open('output.txt', 'a', ...

Also see https://cmdlinetips.com/2012/09/three-ways-to-write-text-to-a-file-in-python/

Upvotes: 0

bruno-uy
bruno-uy

Reputation: 1855

If you want to have all the DataFrames you created in each loop in the variable df_titles you can do something like this:

import pandas as pd

df_titles = pd.concat([func_convert_updates(element) for element in list_news], ignore_index=True)

Upvotes: 1

Related Questions