Reputation: 113
I have a number of *.csv files I am pulling together for a data analsyis.
import csv
import glob
import os
import pandas as pd
### Tells python where to look for the *.csv files we want to combine.
mycsvdir1 = 'C:\\Users\\RDEL1LCH\Documents\QuadcamROI\LWIR'
mycsvdir2 = 'C:\\Users\\RDEL1LCH\Documents\QuadcamROI\Manta01'
mycsvdir3 = 'C:\\Users\\RDEL1LCH\Documents\QuadcamROI\SWIR'
mycsvdir4 = 'C:\\Users\\RDEL1LCH\Documents\QuadcamROI\LWIR2'
mycsvdir5 = 'C:\\Users\\RDEL1LCH\Documents\QuadcamROI\Manta012'
mycsvdir6 = 'C:\\Users\\RDEL1LCH\Documents\QuadcamROI\SWIR2'
#### Creates lists of all *.csv files to be combined
thelist = glob.glob(os.path.join(mycsvdir1,'*.csv')) + \
glob.glob(os.path.join(mycsvdir2,'*.csv')) + \
glob.glob(os.path.join(mycsvdir3,'*.csv')) + \
glob.glob(os.path.join(mycsvdir4,'*.csv')) + \
glob.glob(os.path.join(mycsvdir5,'*.csv')) + \
glob.glob(os.path.join(mycsvdir6,'*.csv'))
#### Reads each *.csv file with a standard header row for each dataframe
#### so they can be concatenated later
dataframe = []
for csvfile in thelist:
df = pd.read_csv(csvfile,names=['a','b','c','d','e',\
'f','g','h','i','j',\
'k','l','m','n','o',\
'p','q','r','s'], header=0)
dataframe.append(df)
#### Takes the individual dataframes and concatenates them into one large *.csv
combined = pd.concat(dataframe, ignore_index = True)
combined.to_csv('combined.csv', index = False)
This is working just as intended, but I need to be able to track where each row comes from. In the individual *.csv files, each row contains either 1, 2, 3, or 4 in the first column, but I would like to append an L, M, or H to the first column, depending upon which subdirectory the *.csv file came from. Thus in the combined file, each data row would have one of L1, L2, L3, L4, M1, M2, M3, M4, H1, H2, H3, or H4 in the first column.
What I have done in the past is separate the read command by subdirectory, and edited accordingly. Is there a way to do it on the fly with my combined read command or is separating the read commands the best strategy?
EDIT:
Here is what I have now based on the first answer:
import csv
import glob
import os
import pandas as pd
### Tells python where to look for the *.csv files we want to combine.
mycsvdir1 = 'C:\\Users\\RDEL1LCH\Documents\QuadcamROI\LWIR'
mycsvdir2 = 'C:\\Users\\RDEL1LCH\Documents\QuadcamROI\Manta01'
mycsvdir3 = 'C:\\Users\\RDEL1LCH\Documents\QuadcamROI\SWIR'
mycsvdir4 = 'C:\\Users\\RDEL1LCH\Documents\QuadcamROI\LWIR2'
mycsvdir5 = 'C:\\Users\\RDEL1LCH\Documents\QuadcamROI\Manta012'
mycsvdir6 = 'C:\\Users\\RDEL1LCH\Documents\QuadcamROI\SWIR2'
alldirs = pd.DataFrame({
'letter': ['L', 'M', 'H','L', 'M', 'H'], # duplicates are OK
'csv': [glob.glob(os.path.join(d, '*.csv')) for d in [mycsvdir1, \
mycsvdir2, mycsvdir3, mycsvdir4, mycsvdir5, mycsvdir6]]
})
# build the list of letters and CSV files
letters = np.repeat(alldirs['letter'], alldirs['csv'].apply(len))
thelist = np.concatenate(alldirs['csv'])
### Reads each *.csv file with a standard header row for each dataframe
### so they can be concatenated later
dataframe = []
for letter, csvfile in pd.Series(thelist,letters).iteritems():
df = pd.read_csv(csvfile,names=['a','b','c','d','e',\
'f','g','h','i','j',\
'k','l','m','n','o',\
'p','q','r','s'], header=0)
dataframe.append(df)
### Concatenates dataframes into one large *.csv
combined = pd.concat(dataframe, ignore_index = True)
combined.to_csv('combined.csv', index = False)
But the output hasn't changed. The first column in each row still shows either a 1,2,3, or 4. I think the problem is in my pd.read_csv call, but I'm not sure how to fix it.
Upvotes: 1
Views: 84
Reputation: 93161
You use the DataFrame itself to perform the letter-CSV file mapping:
alldirs = pd.DataFrame({
'letter': ['L', 'M', 'L'], # duplicates are OK
'csv': [glob.glob(os.path.join(d, '*.csv')) for d in [mycsvdir1, mycsvdir2, mycsvdir3]]
})
# build the list of letters and CSV files
letters = np.repeat(alldirs['letter'], alldirs['csv'].apply(len))
thelist = np.concatenate(alldirs['csv'])
# read each CSV file
for letter, csvfile in pd.Series(thelist, letters).iteritems():
df = pd.read_csv(...)
df['a'] = letter + df['a').str
# if pandas report an error: cannot add string and int together, use
# df['a'] = letter + df['a'].astype(str).str
Upvotes: 1