Reputation: 135
so I am converting multiple docx files to a dataframe file. The code works for one document and this leads to the following structure:
data = {'Title': ['title first article, 'title second article'], 'Sources': ['source of first article', 'source of second article']}
df = pd.DataFrame(data=data)
The structure is the result from a function:
def func_convert_updates(filename):
path = os.chdir('C:/Users/docxfiles')
regex = '\xc2\xb7'
with open(filename, "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
text = result.value # The raw text
text2=re.sub(u'[|•●]', " ", text, count= 0)
with open('output.txt', 'w', encoding='utf-8') as text_file:
text_file.write(text2)
#followed by many lines of code, omitted here, to create a dataframe
return df_titles
And then I want to analyse multiple docx files so therefore I wrote the following code:
list_news= ['docx_file_1', 'docx_file_2.docx', ... etc]
for element in list_news:
df_titles = func_convert_updates(element)
However, this only returns the dataframe of the last element of the list as it overwrites previous output. How can I solve this? Thank you in advance.
Upvotes: 0
Views: 46
Reputation: 1
The actual problem is that if you call your function multiple times you tell open
to write to 'output.txt'
file, overwriting the file if it exists, with the 'w'
argument. You might want to change that to 'a'
to append to the file, so :
with open('output.txt', 'a', ...
Also see https://cmdlinetips.com/2012/09/three-ways-to-write-text-to-a-file-in-python/
Upvotes: 0
Reputation: 1855
If you want to have all the DataFrames you created in each loop in the variable df_titles
you can do something like this:
import pandas as pd
df_titles = pd.concat([func_convert_updates(element) for element in list_news], ignore_index=True)
Upvotes: 1