daphshez
daphshez

Reputation: 9638

Reading both raw lines and dicionaries from csv in Python

My scenario: I am reading a csv file. I want to have access to both a dictionary of the fields generated by each line, and the raw, un-parsed line.

The goal is ultimately to do some processing on the fields, use the result to decide which lines I am interested in, and write those lines only into an output file.

An easy solution, involving reading the file twice looks something like:

def dict_and_row(filename):
    with open(filename) as f:
        tmp = [row for row in DictReader(f)]

    with open(filename) as f:
        next(f)    # skip header
        for i, line in enumerate(f):
            if len(line.strip()) > 0:
                yield line.strip(), tmp[i]

Any better suggestions?

Edit: to be more specific about the usage scenario. I intended to index the lines by some of the data in the dict, and then use this index to find lines I am interested in. Something like:

d = {}
for raw, parsed in dict_and_row(somefile):
      d[(parsed["SOMEFIELD"], parsed ["ANOTHERFIELD"])] = raw

and then later on

for pair in some_other_source_of_pairs:
      if pair in d:
            output.write(d[pair])

Upvotes: 4

Views: 2781

Answers (3)

daphshez
daphshez

Reputation: 9638

I ended up wrapping the file with an object that saves the last line read, and the handing this object to the DictReader.

class FileWrapper:
  def __init__(self, f):
    self.f = f
    self.last_line = None

  def __iter__(self):
    return self

  def __next__(self):
    self.last_line = next(self.f)
    return self.last_line

This could be then used this way:

  f = FileWrapper(file_object)
  for row in csv.DictReader(f):
      print(row)   # that's the dict
      print(f.last_line)   # that's the line

Or I can implement dict_and_row:

 def dict_and_row(filename):
    with open(filename) as f:
         wrapper = FileWrapper(f)
         reader = DictReader(wrapper)
         for row in reader:
              yield row, wrapper.last_line 

This also allows access to other properties such as the number of characters read.

Not sure that's the best solution but it does have the advantage of retaining access to strings as they were originally read from the file.

Upvotes: 8

Deacon
Deacon

Reputation: 3803

This is similar to something that I had to do at one point. I needed to put rows of properly-formatted CSV data into a list, manipulate it, and then save it. I used io.StringIO() to get CSV to write to a list, then passed that back. Without your data, I can't be 100% certain, but this should work. Note that, rather than reading the file in twice, I'm reading it in once and then writing the relevant lines back into CSV format.

import csv
from io import StringIO

def dict_and_row(filename):
    field_names = ['a', 'b']  # Your field names here.
    output = StringIO(newline='\n')
    with open(filename, 'r', newline='\n') as f:
        writer = csv.DictWriter(output, fieldnames=field_names)
        reader = csv.DictReader(f)

        writer.writeheader()  # If you want to return the header.
        for line in reader:
            if True:  # Do your processing here...
                writer.writerow(line)

    data = [line.strip() for line in output.getvalue().splitlines()]

    for line in data:
        yield line

Upvotes: 1

ComputerFellow
ComputerFellow

Reputation: 12108

You could use Pandas which is an excellent library to do such kind of processing...

import pandas as pd

# read the csv file
data = pd.read_csv('data.csv')

# do some calculation on a column and store it in another column
data['column2'] = data['column1'] * 2

# If you decide that you need only a particular set of rows
# that match some condition of yours
data = data[data['column2'] > 100]

# store only particular columns back    
cols = ['column1', 'column2', 'column3']
data[cols].to_csv('data_edited.csv')

Upvotes: 4

Related Questions