Muhammad Adeel Ahmed
Muhammad Adeel Ahmed

Reputation: 43

Python - Merge Excel Files with missing column names

Help & valuable advise needed from the experts please.

I have 100s of csv files which i would like to merge but table columns aren't same. For example

I would like to merge the data from all of these csv files.

I thought of a logic but I am struggling to implement it in a python code. I think it requires a Dataframe with pre-defined headers. (Header1, Header2, Header3, Header4). Then i should loop through each csv file to search for that header. If the header exists then the data is appended to the dataframe else skip it.

Can someone please advise if there is a function in Python that can make it simple ? I have tried to use read_csv but the data structure should be the same when looping throw the csv files.

import pandas as pd
import os
import glob

# Location of CSV Files
data_location='C:\\Tableau Reports\\ST Database\\Files\\Downloads\\CSV Files\\*.csv'
# This will give path location to all CSV files
csv_files=glob.glob(data_location)

for csv_file in csv_files:
    # Encoding to be changed (UTF-8).
    with open(csv_file, newline='', encoding='cp1252') as csvfile:
        df_file = pd.read_csv(csvfile)

Combined_df = pd.concat(df_file)

print(Combined_df)

I tried to following advise given in this forum but getting error at line 12 df_file = pd.read_csv(csvfile).

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 114934: character maps to

Any help or advise is greatly appreciated. Thank you

Upvotes: 1

Views: 212

Answers (1)

mitoRibo
mitoRibo

Reputation: 4548

EDIT

The following code you posted seems overly complex. You don't have to use open with pd.read_csv. You can directly pass in a file name as a string

for csv_file in csv_files:
    # Encoding to be changed (UTF-8).
    with open(csv_file, newline='', encoding='cp1252') as csvfile:
        df_file = pd.read_csv(csvfile)

The following should work using a list-comprehension

df = pd.concat(pd.read_csv(csv_file) for csv_file in csv_files)

If you're not a fan of list-comprehensions then you can instead start with an empty df and keep concatenating your small dfs onto it

large_df = pd.DataFrame()
for csv_file in csv_files:
    small_df = pd.read_csv(csv_file)
    large_df = pd.concat((large_df,small_df))

ORIGINAL

pd.concat can already do this for you! I've made example data. f1_r1 means "file1, row1" just to show that it's working as expected

import pandas as pd
import io

#The io.StringIO makes these "pretend" .csv files for illustration
#All of this is just to create the example data, you won't do this
file1_name = io.StringIO("""
Header1,Header2,Header3
f1_r1,f1_r1,f1_r1
f1_r2,f1_r2,f1_r2
""")

file2_name = io.StringIO("""
Header1,Header2
f2_r1,f2_r1
f2_r2,f2_r2
""")

file3_name = io.StringIO("""
Header3,Header4
f3_r1,f3_r1
f3_r2,f3_r2
""")

file4_name = io.StringIO("""
Header1,Header3,Header4
f4_r1,f4_r1,f4_r1
f4_r2,f4_r2,f4_r2
""")

#this would be a list of your .csv file names
files = [file1_name, file2_name, file3_name, file4_name]

#pd.concat already handles this use-case for you!
combined_df = pd.concat(pd.read_csv(f_name,encoding='utf8') for f_name in files)
print(combined_df)

enter image description here

Upvotes: 4

Related Questions