Mike
Mike

Reputation: 507

manipulating TSV file

I have a .tsv file of text data, linked here: world bank data

I need to read the data into a data structure (any data structure), keep only rows with the date "7/1/2000" or 7/1/2010", and keep only 7 of the original 19 columns. I'm pretty new to programming, but I thought this would be a fairly straightforward task. I have code here that is incomplete:

import math
import csv
import re


fhand=open("world_bank_indicators.txt", "rU")
reader=csv.reader(fhand, dialect="excel", delimiter="\t",skipinitialspace=True)

data=dict()
mylist=list()

#doesn't print headers, however
for row in reader:
    if row[1]=="7/1/2000" or row[1]=="7/1/2010":
        print row[0], row[1], row[9], row[4], row[6], row[5], row[19]

UPDATE, version using dictreader:

import math
import csv
import re

fhand=open("world_bank_indicators.txt", "rU")
reader=csv.reader(fhand, dialect="excel", delimiter="\t", lineterminator="\n")

reader=csv.DictReader(fhand, delimiter="\t")
myset=set()
mydict=dict()
mylist=list()

for row in reader:
    mydict["Date"]=row["Date"]

print mydict

however, this only seems to write one entry {date:7/1/2010} to my dict

so, my obvious issues are:

1) this does print the rows and columns of data I want, however, it doesn't print the headers of each column 2) obviously, nothing has been added to any data structure yet in order for me to perform any calculations with the data

what (painfully obvious, I'm sure) thing am I missing here? How can I accomplish these two simple tasks?

Upvotes: 1

Views: 5238

Answers (1)

abarnert
abarnert

Reputation: 365767

First, to print the headers, all you have to do is print the first row's columns out, the same way you do for all the 7/1/2000 row's columns. For example:

headers = next(reader)
print row[0], row[1], row[9], row[4], row[6], row[5], row[19]

for row in reader:
    if row[1]=="7/1/2000" or row[1]=="7/1/2010":
        print row[0], row[1], row[9], row[4], row[6], row[5], row[19]

This isn't actually generating a TSV, however; you're just separating the columns by a single space. Plus, of course, you're just printing them to stdout, not writing them to a file. The easiest solution is to use a csv.writer the same way you used the csv.reader:

writer=csv.writer(outfile, dialect="excel", delimiter="\t",skipinitialspace=True)
# …
writer.writerow(row[0], row[1], row[9], row[4], row[6], row[5], row[19])

To make this a little cleaner, and avoid repeating that column list twice, you may want to use operator.itemgetter, like this:

columns = 0, 1, 9, 4, 6, 5, 19
getter = operator.itemgetter(*columns)
# ...
writer.writerow(getter(row))

If you want to store the values rather than write them out, just pass them to the append method of a list object, instead of the writerow method of a csv.writer object. (You probably also want to keep headers around as well.)

headers = getter(next(reader))

data = []
for row in reader:
    if row[1]=="7/1/2000" or row[1]=="7/1/2010":
        data.append(getter(row))

You can make this more concise with a comprehension:

data = [getter(row) for row in reader if row[1]=="7/1/2000" or row[1]=="7/1/2010"]

If you want to use a DictReader instead, you won't get the columns by index anymore, you'll get them by name. Also, a DictReader automatically consumes the header row, to figure out those names. Here's a simplified example. Imagine this was your data:

name,email,rank
Joe,[email protected],7
Jim,[email protected],12
Jen,[email protected],2

And let's say we just wanted the email of people in the top 10. So, here it is with reader:

reader = csv.reader(f)
headers = next(reader)
data = [row[1] for row in reader if int(row[2]) < 10]

And here it is with DictReader:

reader = csv.DictReader(f)
data = [row['email'] for row in reader if int(row['rank']) < 10]

It may not be quite as concise, but it's probably more readable.

Upvotes: 1

Related Questions