Reputation: 45
I promise I searched and read several pages of google before I came to make this post. Due diligence has been done I swear.
I am trying to open a CSV file in python, read the file, make changes to it, and then write out a new file.
I got this far:
import csv
def water_data ():
with open('aquastat.csv', 'r') as csv_file:
csv_reader = csv.reader(csv_file)
final_file_name = "final_water.data.csv"
final_file = open(final_file_name,'w')
csv_writer = csv.writer(final_file,delimiter="\t")
for row in csv_reader:
csv_writer.writerow(row)
But I'm struggling to get any further. I want to remove certain columns, but I cannot comprehend how python will know the difference between a row and a column. For example, the columns are Area, Area ID, Year, Value
, etc. I only want Area, Year, Value
. I tried
for row in final_file:
final_file.writerow(row[0] + row[2] + row[4] + row[5])
but I kept getting the following error: IndexError: list index out of range
[I would also like to replace blank cells with a *, but the column thing is the priority]
Note that I cannot use Pandas
If possible I would really appreciate if someone could not just tell me the code but explain it to me so I can figure it out further myself.
TLDR: How can I remove empty rows from the CVS file and write only certain columns into the new file?
INPUT:
"Area","Area Id","Variable Name","Variable Id","Year","Value","Symbol","Md"
"Afghanistan",2,"Total area of the country",4100,1977,65286.0,"E","",""
"Afghanistan",2,"Total area of the country",4100,1982,65286.0,"E","",""
"Afghanistan",2,"Total area of the country",4100,1987,65286.0,"E","",""
"Afghanistan",2,"Total area of the country",4100,1992,65286.0,"E","",""
"Afghanistan",2,"Total area of the country",4100,1997,65286.0,"E","",""
"Afghanistan",2,"Total area of the country",4100,2002,65286.0,"E","",""
Upvotes: 3
Views: 1587
Reputation: 23743
You can use a DictReader
and DictWriter
to selectively modify and write specific columns using their header/column names.
I'll use io.StringIO
to simulate the files
s = '''"Area","Area Id","Variable Name","Variable Id","Year","Value","Symbol","Md"
"Afghanistan",2,"Total area of the country",4100,1977,65286.0,"E","",""
"Afghanistan",2,"Total area of the country",4100,1982,65286.0,"E","",""
"Afghanistan",2,"Total area of the country",4100,1987,65286.0,"E","",""
"Afghanistan",2,"Total area of the country",4100,1992,65286.0,"E","",""
"Afghanistan",2,"Total area of the country",4100,1997,65286.0,"E","",""
"Afghanistan",2,"Total area of the country",4100,2002,65286.0,"E","",""'''
f = io.StringIO(s)
g = io.StringIO()
reader = csv.DictReader(f)
writer = csv.DictWriter(g, fieldnames=["Area","Variable Id","Value"], extrasaction='ignore')
for row in reader:
#process row values?
row['Value'] = float(row['Value']) / 1000
writer.writerow(row)
Note that the DictWriter extrasaction
parameter needs to be set to 'ignore'
because there are extra keys/fields in the original.
If the csv file does not have a header row, you have to specify the field names for the DictWriter.
>>> g.seek(0)
0
>>> print(g.read())
Afghanistan,4100,65.286
Afghanistan,4100,65.286
Afghanistan,4100,65.286
Afghanistan,4100,65.286
Afghanistan,4100,65.286
Afghanistan,4100,65.286
Upvotes: 0
Reputation: 12438
I have tried to provide you an answer as close as possible than what you have done so far.
Prototype:
import csv
with open('aquastat.csv', 'r') as csv_file:
csv_reader = csv.reader(csv_file)
final_file_name = "final_water.data.csv"
final_file = open(final_file_name,'w')
csv_writer = csv.writer(final_file,delimiter="\t")
for row in csv_reader:
if len(row) >= 6:
row = [row[0], row[4], row[5]]
csv_writer.writerow(row)
final_file.close()
explanations:
csv_writer.writerow(row)
where you output the row in the output csv file. I have added the line row = [row[0], row[4], row[5]]
where I overwrite the content of the array row
by an array containing only 3 cells, those cells are respectively taken from the Area
, Year
, Value
columnsif len(row) >= 6:
to check that you have at least enough elements in your row to extract the columns until Value
.input:
"Area","Area Id","Variable Name","Variable Id","Year","Value","Symbol","Md"
"Afghanistan",2,"Total area of the country",4100,1977,65286.0,"E","",""
"Afghanistan",2,"Total area of the country",4100,1982,65286.0,"E","",""
"Afghanistan",2,"Total area of the country",4100,1987,65286.0,"E","",""
"Afghanistan",2,"Total area of the country",4100,1992,65286.0,"E","",""
"Afghanistan",2,"Total area of the country",4100,1997,65286.0,"E","",""
"Afghanistan",2,"Total area of the country",4100,2002,65286.0,"E","",""
output:
Area Year Value
Afghanistan 1977 65286.0
Afghanistan 1982 65286.0
Afghanistan 1987 65286.0
Afghanistan 1992 65286.0
Afghanistan 1997 65286.0
Afghanistan 2002 65286.0
Upvotes: 3
Reputation: 3265
This line will not IndexError
and will write the line ignoring non existing values:
final_file.writerow((row[i] for i in (0,2,5) if i<len(row)))
This line will not IndexError
and will write the line replacing empty values with a star:
final_file.writerow((row[i] if i<len(row) else "*" for i in (0,2,5)))
This line will not IndexError
either but won't write the line:
if len(row)>5: final_file.writerow((row[i] for i in (0,2,5)))
This line will not IndexError
either but won't write any line at all:
pass
Upvotes: 0