Reputation: 25
I am working on a problem in which I have a very large dataset in the form of a csv file. This csv file has various columns, one of which is state code. The dataset is significantly larger than I need it to be- I just need the values from one state.
A solution I was thinking of using would be to read in the csv file using Python and then write to a new file with just the rows that I need- removing 49 of the 50 states (U.S).
The csv file has 3million+ rows. I am new to Python and I am not sure how I can effectively do this, what are the best ways to complete this task?
Thank you for your help and I apologize if this seems like a simple question- I am new to Python.
Upvotes: 1
Views: 1413
Reputation: 4275
Deleting all the rows is different from extracting just the one state. Deleting lines from files is dealt with sufficiently here. Assuming what you want is to extract just the one state, here's one way you could do it.
with open("file.csv", "r") as file:
for line in file:
sline = line.split(",")
if sline[n] == statecode:
# Where n is the position of the statecode column zero indexed
# and statecode is your target state code
pass # this is where you do something with the data
With this pattern you implement pretty much any solution that involves incrementing through the csv line by line. This includes writing to a new csv, working with the data in memory, pandas
or dask
solutions, and so on.
Your system very may well be able to handle just directly importing to pandas
or dask
if that's what you want. These are popular data munging python tools with built-in read_csv
methods.
Upvotes: 2