Reputation: 63
I'm trying to combine CSV files in a folder to analyze them. Also, I want to append the filename of each as a column so I can figure out which data came from which file. I've looked at the similar questions and none of them have worked for me.
Here's the code I'm using. There are 24 CSV files in this folder and since combining CSV files later would be easy using cat so even a method to tell me how I could append the filename in each file would be perfect. Any help would be great.
import pandas as pd
import os
import glob
import csv
path=r'/home/videept/Downloads/A_DeviceMotion_data/A_DeviceMotion_data/dws_1/'
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
for filename in glob.glob(os.path.join(path,"*.csv")):
with open(filename, newline='') as f_input:
csv_input = csv.reader(f_input)
for row in csv_input:
row.insert(0, filename)
csv_output.writerow(row)
When I'm doing this the cell goes on an infinite loop and no new file is even created. I'm not sure how I can see the progress of what's going on so any idea on that would also be great. Thanks :)
Upvotes: 4
Views: 6703
Reputation: 1918
I would do it this way (provided your are using Python 3.4+):
import pandas as pd
from pathlib import Path
source_files = sorted(Path('path_to_source_directory').glob('*.csv'))
dataframes = []
for file in source_files:
df = pd.read_csv(file) # additional arguments up to your needs
df['source'] = file.name
dataframes.append(df)
df_all = pd.concat(dataframes)
This way, every row has a column represents its source file for easy filtering and analysis.
Upvotes: 6
Reputation: 2859
First, make sure that all the csv files have the same structure. Then make sure that you can read one csv file properly. Then you can do it iteratively:
import pandas as pd
import glob
df_all = pd.DataFrame()
for f in glob.glob("path/to/csv/files/prefix_*.csv"):
df = pd.read_csv(f) # make sure to apply correct settings (sep, parse_dates, headers, missing_values)
df["origin"] = f #add a column with a csv name
df_all = df_all.append(df) #append new df to the "master" dataframe
df_all.to_csv("merged.csv")
UPDATE: If you are afraid all the data wouldn't fit in your memory, take a look at the Dask library.
Upvotes: 1