user987654321
user987654321

Reputation: 303

CSV to List of Dictionaries - Better Way?

I'm working on a function that takes in a filename of a CSV and converts each line to a dictionary and then returns a list of the dictionaries created (to be able to iterate through and organize in later functions. I've gotten it to do what I want by doing the following but feel there's got to be a better way. Any suggestions for improvement?

import re

def import_incidents(filename):
    """Imports CSV and returns list of dictionaries for each incident"""
    with open(filename, 'r') as file:
        data = file.read()
        data = data.split('\n')
        list_of_data = []
        headers = True
        for line in data:
            line = line.split('","')
            if headers == True:
                #Skip header and set to false
                headers = False
            elif len(line) == 1 or line[3] == '':
                #File always has a 1 lenth final line, skip it.
                #Events can leave blank policies, skip those too.
                pass
            else:
                temp_dict = {}
                temp_dict['id'] = re.sub('"', '', line[0])
                temp_dict['time'] = re.sub('GMT-0600','',line[1])
                temp_dict['source'] = line[2]
                temp_dict['policy'] = line[3]
                temp_dict['destination'] = line[5]
                temp_dict['status'] = line[10]
                list_of_data.append(temp_dict)

return list_of_data

print(import_incidents('Incidents (Yesterday Only).csv'))

Sample of CSV contents:

"ID","Incident Time","Source","Policies","Channel","Destination","Severity","Action","Maximum Matches","Transaction Size","Status",
"9511564","29 Dec. 2015, 08:33:59 AM GMT-0600","Doe, John","Encrypted files","HTTPS","blah.blah.com","Medium","Permitted","0","47.7 KB","Closed - Authorized",
"1848446","29 Dec. 2015, 08:23:36 AM GMT-0600","Smith, Joe","","HTTP","google.com","Low","Permitted","0","775 B","Closed"

Upvotes: 1

Views: 4394

Answers (2)

Mike Müller
Mike Müller

Reputation: 85622

Original task: List of dicts

You can use pandas. It is fast and can do it in a few lines;

import pandas as pd
df = pd.read_csv('incidents.csv')
df['Incident Time'] = df['Incident Time'].str.replace('GMT-0600', '')
list_of_data = df.dropna(subset=['Policies']).to_dict(orient='records')

Now list_of_data contains:

[{'Action': 'Permitted',
  'Channel': 'HTTPS',
  'Destination': 'blah.blah.com',
  'ID': 9511564,
  'Incident Time': '29 Dec. 2015, 08:33:59 AM ',
  'Maximum Matches': 0,
  'Policies': 'Encrypted files',
  'Severity': 'Medium',
  'Source': 'Doe, John',
  'Status': 'Closed - Authorized',
  'Transaction Size': '47.7 KB',
  'Unnamed: 11': nan}]

The .dropna(subset='Policies') removes the lines that have NaNs in the column Policies, i.e. missing values.

Pandas-only solution:

If you don't want the list of dicts, keep the dataframe:

df = pd.read_csv('incidents.csv', parse_dates=[1]).dropna(subset=['Policies'])

This reads the Incident Time as very convenient datetime64[ns]object. The dataframe looks like this:

enter image description here

Upvotes: 1

Martijn Pieters
Martijn Pieters

Reputation: 1125398

You have reinvented the csv.DictReader() class, I'm afraid:

import csv

def import_incidents(filename):
    with open(filename, 'r', newline='') as file:
        reader = csv.DictReader(file)
        for row in reader:
            if not row or not row['Policies']:
                continue
            row['Incident Time'] = re.sub('GMT-0600', '', row['Incident Time'])
            yield row

This relies on the header row for the dictionary keys. You can define your own dictionary keys with the fieldnames argument to DictReader() (the fieldnames field is matched, in order, to the columns in the file), but then the first row in the file is still read like any other row. You can use the next() function to skip rows (see Skip the headers when editing a csv file using Python).

Upvotes: 7

Related Questions