Reputation: 131

Resampling Multiple CSV Files and Automatically Saving Resampled Files with New Names

I've really tried hard now for over a week to solve this problem and I cannot seem to find a solution. Some coders have been excellent in helping but unfortunately no one is yet to provide suggestions that have worked for me. I'm going to try and ask the same question as simply as possible.

I have many (over 100) csv files. All csv files have "Datetime" as their first column. The "Datetime" format is in "YYYY-MM-DD HH:MM:SS". Each file provides rows of data every 15 minutes over an entire month (a lot of rows of data). All the csv files are located across three separate folders, each with the following paths:

"C:\Users\Documents\SummaryData\24Hour"

"C:\Users\Documents\SummaryData\Daytime"

"C:\Users\Documents\SummaryData\Nighttime"

The csv files in the 24 Hour folder span 24 hour time frames. The csv files in the Daytime folder span 06:00 - 18:00 for MM:SS. The csv files in the Nightime folder span 18:00 - 06:00 for MM:SS.

For example, a csv file exists for the month of August in 2015. For this month, in the 24 Hour folder, we have a csv file that provides non-stop 15-minute interval data for the entire month of August in 2015.

For the same month and year, another csv file exists, in the Daytime folder, that provides data just for the times from 06:00 - 18:00. For example, see below a snippet of the file. I chose at random to provide data starting on the 12th day of August.

And further into the month for example:

The same files exist for Nighttime as well but span time throughout the night.

Please note that a lot more columns than the ones shown in the above images exist.

While keeping these original 15-minute interval files, I need to resample all csv files so each one has its own Hourly, Daily and Monthly file. The tricky part, is that I want some of the columns to be summed over the resampling time frame, while other columns need to be averaged over the time frame.

So if I'm resampling the data for the day, I need some columns to average its data over the day, while others sum its data over the day. But nonetheless, I need one daily csv file that has been created from these original 15-minute interval csv files. Across all files though, columns with the same header name need the same resampling (so if column["windspeed"] needs to be averaged over the day, then that will be the same for column["windspeed"] in another csv file).

Another tricky part, is that I also need these files to be exported as csv files (to any output location, say "C:\Users\cp_vm\Documents\Output") and renamed automatically to signify how they have been resampled.

So taking the example of the csv file for August in 2015 that is currently named:

"2015August.csv",

if I resample this file to be hourly, daily and monthly, I want all these new resampled csv files to be saved as:

"2015AugustHourly.csv" and;

"2015AugustDaily.csv" and;

"2015AugustMonthly.csv" respectively.

I know I need to use some form of 'for loop' and I really have tried. But I cannot figure this one out. Any help would be greatly appreciated! And thanks to all those who have already provided advice.

Output example below showing the values being averaged over the hours:

The below example shows some of the additional columns (SR_Gen and SR_All) which are results of summing the 15-minute data over the hours.

Upvotes: 0

Answers (2)

Monte Cristo

Reputation: 131

@NickBraunagel sincere thanks for the time you've put in to this question. I apologise for my tardiness in replying. I was also on vacation and I only just returned. Your code looks very good and is potentially more efficient than my own. I will run it as soon as work quietens down to see if this is the case. However, while I was waiting for a response, I managed to solve the issue myself. I have uploaded the code below.

To avoid writing out each column name and whether to 'mean' or 'sum' the data over the re-sampling time period, I have manually created another excel document that lists the column headers in row 1 and lists "mean" or "sum" below the header (n*columns x 2 rows), then I convert this csv to a dictionary and refer to it in the re-sampling code. See Below.

Also, I import the data already being 24Hour, Daytime and Nightime files, and then re-sample.

import pandas as pd
import glob

#project specific paths - comment (#) all paths not relevant

#read in manually created re-sampling csv file to reference later as a dictionary in the re-sampling code
#the file below consists of n*columns x 2 rows, where row 1 is the column headers and row 2 specifies whether that column is to be averaged ('mean') or summed ('sum') over the re-sampling time period
f =pd.read_csv('C:/Users/cp_vm/Documents/ResampleData/AllData.csv')

#convert manually created resampling csv to dictionary ({'columnname': resample,'columnname2': resample2)}
recordcol = list(f.columns.values)
recordrow = f.iloc[0,:]
how_map=dict(zip(recordcol,recordrow))
what_to_do = dict(zip(f.columns, [how_map[x] for x in recordcol]))

#this is not very efficient, but for the time being, comment (#) all paths not relevant
#meaning run the script multiple times, each time changing the in' and outpaths
#read in datafiles via their specific paths: order - AllData 24Hour, AllData DayTime, AllData NightTime
inpath = r'C:/Users/cp_vm/Documents/Data/Input/AllData/24Hour/'
outpath = 'C:/Users/cp_vm/Documents/Data/Output/AllData/24Hour/{0}_{1}_{2}_AllData_24Hour.csv'

#inpath = r'C:/Users/cp_vm/Documents/Data/Input/AllData/Daytime/'
#outpath = 'C:/Users/cp_vm/Documents/Data/Output/AllData/Daytime/{0}_{1}_{2}_AllData_Daytime.csv'

#inpath = r'C:/Users/cp_vm/Documents/Data/Input/AllData/Nighttime/'
#outpath = 'C:/Users/cp_vm/Documents/Data/Output/AllData/Nighttime/{0}_{1}_{2}_AllData_Nighttime.csv'

allFiles = glob.glob(inpath + "/*.csv")

#resample all incoming files to be hourly-h, daily-d, or monthly-m and export with automatic naming of files
for files_ in allFiles:
    #read in all files_
    df = pd.read_csv(files_,index_col = None, parse_dates = ['Datetime'])
    df.index = pd.to_datetime(df.Datetime)
    #change Datetime column to be numeric, so it can be resampled without being removed
    df['Datetime'] = pd.to_numeric(df['Datetime'])
    #specify year and month for automatic naming of files
    year = df.index.year[1]
    month = df.index.month[1]
    #comment (#) irrelevant resamplping, so run it three times, changing h, d and m
    resample = "h"
    #resample = "d"
    #resample = "m"
    #resample df based on the dictionary defined by what_to_do and resample - please note that 'Datetime' has the resampling 'min' associated to it in the manually created re-sampling csv file
    df = df.resample(resample).agg(what_to_do)
    #drop rows where all column values are non existent
    df = df.dropna(how='all')
    #change Datetime column back to datetime.datetime format
    df.Datetime = pd.to_datetime(df.Datetime)
    #make datetime column the index
    df.index = df.Datetime
    #move datetime column to the front of dataframe
    cols = list(df.columns.values)
    cols.pop(cols.index('Datetime'))
    df = df[['Datetime'] + cols]
    #export all files automating their names dependent on their datetime
    #if the dataframe has any rows, then export it
    if df.shape[0] > 0:
        df.to_csv(outpath.format(year,month,resample), index=False)

Upvotes: 0

NickBraunagel

Reputation: 1599

I think you can re-use the code from our previous work (here). Using the original code, when the NIGHT and DAY dataframes are created, you can then resample them on an hourly, daily, and monthly basis and save the new (resampled) dataframes as .csv files wherever you like.

I am going to use a sample dataframe (first 3 rows shown here):

dates               PRp         PRe         Norm_Eff    SR_Gen      SR_All
2016-01-01 00:00:00 0.269389    0.517720    0.858603    8123.746453 8770.560467
2016-01-01 00:15:00 0.283316    0.553203    0.862253    7868.675481 8130.974409
2016-01-01 00:30:00 0.286590    0.693997    0.948463    8106.217144 8314.584848

Full Code

import pandas as pd
import datetime
from dateutil.relativedelta import relativedelta
from random import randint
import random
import calendar

# I defined a sample dataframe with dummy data
start = datetime.datetime(2016,1,1,0,0)
r = range(0,10000)

dates = [start + relativedelta(minutes=15*i) for i in r]
PRp = [random.uniform(.2, .3) for i in r]
PRe = [random.uniform(0.5, .7) for i in r]
Norm_Eff = [random.uniform(0.7, 1) for i in r]
SR_Gen = [random.uniform(7500, 8500) for i in r]
SR_All = [random.uniform(8000, 9500) for i in r]

DF = pd.DataFrame({
        'dates': dates,
        'PRp': PRp,
        'PRe': PRe,
        'Norm_Eff': Norm_Eff,
        'SR_Gen': SR_Gen,
        'SR_All': SR_All,
    })



# define when day starts and ends (MUST USE 24 CLOCK)
day = {
        'start': datetime.time(6,0),  # start at 6am (6:00)
        'end': datetime.time(18,0)  # ends at 6pm (18:00)
      }


# capture years that appear in dataframe
min_year = DF.dates.min().year
max_year = DF.dates.max().year

if min_year == max_year:
    yearRange = [min_year]
else:
    yearRange = range(min_year, max_year+1)

# iterate over each year and each month within each year
for year in yearRange:
    for month in range(1,13):

        # filter to show NIGHT and DAY dataframe for given month within given year
        NIGHT = DF[(DF.dates >= datetime.datetime(year, month, 1)) & 
           (DF.dates <= datetime.datetime(year, month, 1) + relativedelta(months=1) - relativedelta(days=1)) & 
           ((DF.dates.apply(lambda x: x.time()) <= day['start']) | (DF.dates.apply(lambda x: x.time()) >= day['end']))]

        DAY = DF[(DF.dates >= datetime.datetime(year, month, 1)) & 
           (DF.dates <= datetime.datetime(year, month, 1) + relativedelta(months=1) - relativedelta(days=1)) & 
           ((DF.dates.apply(lambda x: x.time()) > day['start']) & (DF.dates.apply(lambda x: x.time()) < day['end']))]

        # Create resampled dataframes on Hourly, Daily, Monthly basis
        for resample_freq, freq_tag in zip(['H','D','M'], ['Hourly','Daily','Monthly']):

            NIGHT.index = NIGHT.dates                           # resampled column must be placed in index
            NIGHT_R = pd.DataFrame(data={
                    'PRp': NIGHT.PRp.resample(rule=resample_freq).mean(),            # averaging data
                    'PRe': NIGHT.PRe.resample(rule=resample_freq).mean(),
                    'Norm_Eff': NIGHT.Norm_Eff.resample(rule=resample_freq).mean(),
                    'SR_Gen': NIGHT.SR_Gen.resample(rule=resample_freq).sum(),        # summing data
                    'SR_All': NIGHT.SR_All.resample(rule=resample_freq).sum()  
                })
            NIGHT_R.dropna(inplace=True)  # removes the times during 'day' (which show as NA)

            DAY.index = DAY.dates
            DAY_R = pd.DataFrame(data={
                    'PRp': DAY.PRp.resample(rule=resample_freq).mean(),
                    'PRe': DAY.PRe.resample(rule=resample_freq).mean(),
                    'Norm_Eff': DAY.Norm_Eff.resample(rule=resample_freq).mean(),
                    'SR_Gen': DAY.SR_Gen.resample(rule=resample_freq).sum(),        
                    'SR_All': DAY.SR_All.resample(rule=resample_freq).sum()  
                })
            DAY_R.dropna(inplace=True)  # removes the times during 'night' (which show as NA)

            # save to .csv with date and time in file name
            # specify the save path of your choice
            path_night = 'C:\\Users\\nickb\\Desktop\\stackoverflow\\{0}{1}_NIGHT_{2}.csv'.format(year, calendar.month_name[month], freq_tag)
            path_day = 'C:\\Users\\nickb\\Desktop\\stackoverflow\\{0}{1}_DAY_{2}.csv'.format(year, calendar.month_name[month], freq_tag)

            # some of the above NIGHT_R / DAY_R filtering will return no rows.
            # Check for this, and only save if the dataframe contains rows
            if NIGHT_R.shape[0] > 0:
                NIGHT_R.to_csv(path_night, index=True)
            if DAY_R.shape[0] > 0:
                DAY_R.to_csv(path_day, index=True)

The above will result in a total of SIX .csv files per month:

Hourly basis for daytime
Daily basis for daytime
Monthly basis for daytime
Hourly basis for nighttime
Daily basis for nighttime
Monthly basis for nighttime

Each file will have a file name as follows: (Year)(Month_Name)(Day/Night)(frequency). For example: 2016August_NIGHT_Daily

Let me know if the above achieves the goal or not.

Also, here is a list of available resample frequencies you can choose from: pandas resample documentation

Upvotes: 1

Resampling Multiple CSV Files and Automatically Saving Resampled Files with New Names

Answers (2)

Full Code

Related Questions