Oleg Kazanskyi
Oleg Kazanskyi

Reputation: 306

Python. Append CSV files in a folder into one big file

I am a little confused with a Pandas library and would really appreciate your help.

The task is to combine all *.csv files in a folder into one big file. CSV files don't have a header, so I just want to append all of them and add a header in the end.

Here is the code I use. The final file is "ALBERTA GENERAL", in the beginning I delete the old one before creating an updated version.

os.chdir(dataFolder)
with io.open("ALBERTA GENERAL.csv", "w+", encoding='utf8') as f:
os.remove("ALBERTA GENERAL.csv")
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

combined_csv = pd.concat([pd.read_csv(f, error_bad_lines=False) for f in all_filenames], axis=0, ignore_index = True)
print(combined_csv)

with io.open('ALBERTA GENERAL.csv', "w+", encoding='utf8') as outcsv:
    writer = csv.DictWriter(outcsv, fieldnames=["Brand, Name, Strain, Genre, Product type, Date"],delimiter=";")
    writer.writeheader()

    combined_csv.to_csv(outcsv, index=False, encoding='utf-8-sig')

But I get a confusing result that I don't know how to fix. The final file doesn't append intermediate files one below another, but it adds columns for the next file. I tried to add the same headers to the intermediate files but it did not help.

Other than that the header is not split by columns and is recognized as one line.

Can anyone help me to fix my code, please?

Here is the link to the files

Upvotes: 1

Views: 266

Answers (1)

mechanical_meat
mechanical_meat

Reputation: 169284

Just to fix the irregularities of the first file:

with open('ALBERTA GENERAL.csv','r') as f_in, open('ALBERTA GENERAL_fixed.csv','w') as f_out: 
    for line in f_in: 
        line = line.replace(',',';') 
        line = line.strip().rstrip(';') 
        line = line.strip().lstrip(';') 
        f_out.write(line + '\n') 

os.remove('ALBERTA_GENERAL.csv')

We will import the first file separately because it has different requirements than the others:

df1 = pd.read_csv('ALBERTA GENERAL_fixed.csv',header=0,sep=';')

We can then do the other two:

df2 = pd.read_csv('file_ALBERTA_05.14.2020.csv',header=None,sep=';')
df3 = pd.read_csv('file_ALBERTA_05.18.2020.csv',header=None,sep=';')

df2.columns = df1.columns
df3.columns = df1.columns

Final steps:

combined = pd.concat([df1,df2,df3])
combined.to_csv('out.csv',index=False)

Upvotes: 1

Related Questions