Hamish McBrearty
Hamish McBrearty

Reputation: 179

Search multiple CSV files for specific row and save specific columns along with date information

I have a directory full of CSV file which contains information about the resources used by virtual machines. These files are named 'VM Resources - 2018-06-01 070001.csv' etc and contain ~400 lines.

I'm currently only interested in extracting the entry for one server (Let's call it MAILSERVER01) and the columns VMName (index 0), Total Storage (GB) (index 8) and the date from the file name.

I can use something like:

import os
import glob
import pandas as pd
import numpy as np

path = "my_dir_full_path"
allFiles = glob.glob(os.path.join(path,"*.csv"))

But this doesn't give me the date field that I need. I'm hoping to eventually be able to plug this information into a time series to predict when a server (all 400) is likely to require a disk extension but in the interim need to create the time series data.

I am trying to get an output CSV looking like:

VMName,Total Storage (GB),Date
MAILSERVER01,600,2018-06-01
MAILSERVER01,600,2018-06-02
MAILSERVER01,610,2018-06-03

Upvotes: 0

Views: 45

Answers (1)

Martin Evans
Martin Evans

Reputation: 46779

This could be done using the built in CSV library as follows:

from operator import itemgetter
import csv
import glob                        

req_fields = itemgetter(0, 8)       # Used to extract the two required fields from a list
header = ["VMName", "Total Storage (GB)", "Date"]

with open('output.csv', 'w', newline='') as f_output:
    csv_output = csv.writer(f_output)
    csv_output.writerow(header)

    for filename in glob.glob(r'VM Resources*.csv'):
        date = filename.split()[3]  # split on spaces and take the 4th part

        with open(filename, newline='') as f_input:
            csv_input = csv.reader(f_input)
            header = next(csv_input)    # skip the header

            for row in csv_input:
                output_row = list(req_fields(row)) + [date]
                csv_output.writerow(output_row)

Giving you an output.csv file that would look something like:

VMName,Total Storage (GB),Date
MAILSERVER01,600,2018-06-01
MAILSERVER01,600,2018-06-02
MAILSERVER01,610,2018-06-03

Tested on Python 3.6.7

Upvotes: 1

Related Questions