Reputation: 1
I writing a code that reads a very large CSV file line by line with readlines(). I call the function with a global variable and access that variable to search for specific words and count the number of times it comes up in the file.
def init(filename):
global lines
with open(filename) as file:
lines = file.readlines()
def total():
males = 0
females = 0
for i in range(0, len(lines)):
current_line = lines[i].split(",")
if current_line[5] == 'M\n':
males += 1
elif current_line[5] == 'F\n':
females += 1
total_dict = {"Gender": {"M": males, "F": females}}
return total_dict
for some reason this code works with smaller file, but I can't seem to get to work with a super large one.
Upvotes: 0
Views: 49
Reputation:
If by "super large" you mean something that does not fit in RAM, then it's normal: you read the whole file in RAM, and then you deal with one row at a time: why not read the file line by line then? You could do for line in file: ...
def total(name):
males = females = 0
with open(name, "rt") as f:
for line in f:
current = line.rstrip("\r\n").split(",")
if current[5] == "M":
males += 1
elif current[5] == "F":
females += 1
return {"Gender": {"M": males, "F": females}}
Or with a Counter (it's like a dict but you don't have to initialize zero values, entries are automatically added when you do gender[...] += 1
):
from collections import Counter
def total(name):
gender = Counter()
with open(name, "rt") as f:
for line in f:
current = line.rsplit("\r\n").split(",")
gender[current[5]] += 1
return {"Gender": gender}
Note also that to read a CSV file, you could use the csv module.
import csv
def total(name):
gender = Counter()
with open(name, "rt") as f:
for current in csv.reader(f):
gender[current[5]] += 1
return {"Gender": gender}
Another coding advice, not directly related to you current problem: avoid global variables unless there is a very good reason to use one: here you could simply return the list, if you insist in reading the while file in init
. And when looping over a list, don't use a range as in for i in range(len(a)):
, write instead for x in a:
, unless you really need the index for some reason. And if you need the index, it's often better to write for i, x in enumerate(a):
Upvotes: 1