Reputation: 507
I have a .tsv file of text data, linked here: world bank data
I need to read the data into a data structure (any data structure), keep only rows with the date "7/1/2000" or 7/1/2010", and keep only 7 of the original 19 columns. I'm pretty new to programming, but I thought this would be a fairly straightforward task. I have code here that is incomplete:
import math
import csv
import re
fhand=open("world_bank_indicators.txt", "rU")
reader=csv.reader(fhand, dialect="excel", delimiter="\t",skipinitialspace=True)
data=dict()
mylist=list()
#doesn't print headers, however
for row in reader:
if row[1]=="7/1/2000" or row[1]=="7/1/2010":
print row[0], row[1], row[9], row[4], row[6], row[5], row[19]
UPDATE, version using dictreader:
import math
import csv
import re
fhand=open("world_bank_indicators.txt", "rU")
reader=csv.reader(fhand, dialect="excel", delimiter="\t", lineterminator="\n")
reader=csv.DictReader(fhand, delimiter="\t")
myset=set()
mydict=dict()
mylist=list()
for row in reader:
mydict["Date"]=row["Date"]
print mydict
however, this only seems to write one entry {date:7/1/2010} to my dict
so, my obvious issues are:
1) this does print the rows and columns of data I want, however, it doesn't print the headers of each column 2) obviously, nothing has been added to any data structure yet in order for me to perform any calculations with the data
what (painfully obvious, I'm sure) thing am I missing here? How can I accomplish these two simple tasks?
Upvotes: 1
Views: 5238
Reputation: 365767
First, to print the headers, all you have to do is print the first row's columns out, the same way you do for all the 7/1/2000
row's columns. For example:
headers = next(reader)
print row[0], row[1], row[9], row[4], row[6], row[5], row[19]
for row in reader:
if row[1]=="7/1/2000" or row[1]=="7/1/2010":
print row[0], row[1], row[9], row[4], row[6], row[5], row[19]
This isn't actually generating a TSV, however; you're just separating the columns by a single space. Plus, of course, you're just print
ing them to stdout, not writing them to a file. The easiest solution is to use a csv.writer
the same way you used the csv.reader
:
writer=csv.writer(outfile, dialect="excel", delimiter="\t",skipinitialspace=True)
# …
writer.writerow(row[0], row[1], row[9], row[4], row[6], row[5], row[19])
To make this a little cleaner, and avoid repeating that column list twice, you may want to use operator.itemgetter
, like this:
columns = 0, 1, 9, 4, 6, 5, 19
getter = operator.itemgetter(*columns)
# ...
writer.writerow(getter(row))
If you want to store the values rather than write them out, just pass them to the append
method of a list
object, instead of the writerow
method of a csv.writer
object. (You probably also want to keep headers
around as well.)
headers = getter(next(reader))
data = []
for row in reader:
if row[1]=="7/1/2000" or row[1]=="7/1/2010":
data.append(getter(row))
You can make this more concise with a comprehension:
data = [getter(row) for row in reader if row[1]=="7/1/2000" or row[1]=="7/1/2010"]
If you want to use a DictReader
instead, you won't get the columns by index anymore, you'll get them by name. Also, a DictReader
automatically consumes the header row, to figure out those names. Here's a simplified example. Imagine this was your data:
name,email,rank
Joe,[email protected],7
Jim,[email protected],12
Jen,[email protected],2
And let's say we just wanted the email of people in the top 10. So, here it is with reader
:
reader = csv.reader(f)
headers = next(reader)
data = [row[1] for row in reader if int(row[2]) < 10]
And here it is with DictReader
:
reader = csv.DictReader(f)
data = [row['email'] for row in reader if int(row['rank']) < 10]
It may not be quite as concise, but it's probably more readable.
Upvotes: 1