Reputation: 303
I'm working on a function that takes in a filename of a CSV and converts each line to a dictionary and then returns a list of the dictionaries created (to be able to iterate through and organize in later functions. I've gotten it to do what I want by doing the following but feel there's got to be a better way. Any suggestions for improvement?
import re
def import_incidents(filename):
"""Imports CSV and returns list of dictionaries for each incident"""
with open(filename, 'r') as file:
data = file.read()
data = data.split('\n')
list_of_data = []
headers = True
for line in data:
line = line.split('","')
if headers == True:
#Skip header and set to false
headers = False
elif len(line) == 1 or line[3] == '':
#File always has a 1 lenth final line, skip it.
#Events can leave blank policies, skip those too.
pass
else:
temp_dict = {}
temp_dict['id'] = re.sub('"', '', line[0])
temp_dict['time'] = re.sub('GMT-0600','',line[1])
temp_dict['source'] = line[2]
temp_dict['policy'] = line[3]
temp_dict['destination'] = line[5]
temp_dict['status'] = line[10]
list_of_data.append(temp_dict)
return list_of_data
print(import_incidents('Incidents (Yesterday Only).csv'))
Sample of CSV contents:
"ID","Incident Time","Source","Policies","Channel","Destination","Severity","Action","Maximum Matches","Transaction Size","Status",
"9511564","29 Dec. 2015, 08:33:59 AM GMT-0600","Doe, John","Encrypted files","HTTPS","blah.blah.com","Medium","Permitted","0","47.7 KB","Closed - Authorized",
"1848446","29 Dec. 2015, 08:23:36 AM GMT-0600","Smith, Joe","","HTTP","google.com","Low","Permitted","0","775 B","Closed"
Upvotes: 1
Views: 4394
Reputation: 85622
You can use pandas. It is fast and can do it in a few lines;
import pandas as pd
df = pd.read_csv('incidents.csv')
df['Incident Time'] = df['Incident Time'].str.replace('GMT-0600', '')
list_of_data = df.dropna(subset=['Policies']).to_dict(orient='records')
Now list_of_data
contains:
[{'Action': 'Permitted',
'Channel': 'HTTPS',
'Destination': 'blah.blah.com',
'ID': 9511564,
'Incident Time': '29 Dec. 2015, 08:33:59 AM ',
'Maximum Matches': 0,
'Policies': 'Encrypted files',
'Severity': 'Medium',
'Source': 'Doe, John',
'Status': 'Closed - Authorized',
'Transaction Size': '47.7 KB',
'Unnamed: 11': nan}]
The .dropna(subset='Policies')
removes the lines that have NaN
s in the column Policies
, i.e. missing values.
If you don't want the list of dicts, keep the dataframe:
df = pd.read_csv('incidents.csv', parse_dates=[1]).dropna(subset=['Policies'])
This reads the Incident Time
as very convenient datetime64[ns]
object. The dataframe looks like this:
Upvotes: 1
Reputation: 1125398
You have reinvented the csv.DictReader()
class, I'm afraid:
import csv
def import_incidents(filename):
with open(filename, 'r', newline='') as file:
reader = csv.DictReader(file)
for row in reader:
if not row or not row['Policies']:
continue
row['Incident Time'] = re.sub('GMT-0600', '', row['Incident Time'])
yield row
This relies on the header row for the dictionary keys. You can define your own dictionary keys with the fieldnames
argument to DictReader()
(the fieldnames
field is matched, in order, to the columns in the file), but then the first row in the file is still read like any other row. You can use the next()
function to skip rows (see Skip the headers when editing a csv file using Python).
Upvotes: 7