paul
paul

Reputation: 25

Delete all rows with specific values from csv file

I am working on a problem in which I have a very large dataset in the form of a csv file. This csv file has various columns, one of which is state code. The dataset is significantly larger than I need it to be- I just need the values from one state.

A solution I was thinking of using would be to read in the csv file using Python and then write to a new file with just the rows that I need- removing 49 of the 50 states (U.S).

The csv file has 3million+ rows. I am new to Python and I am not sure how I can effectively do this, what are the best ways to complete this task?

Thank you for your help and I apologize if this seems like a simple question- I am new to Python.

Upvotes: 1

Views: 1413

Answers (2)

Charles Landau
Charles Landau

Reputation: 4275

Deleting all the rows is different from extracting just the one state. Deleting lines from files is dealt with sufficiently here. Assuming what you want is to extract just the one state, here's one way you could do it.

with open("file.csv", "r") as file:
    for line in file:
        sline = line.split(",")
        if sline[n] == statecode:
            # Where n is the position of the statecode column zero indexed
            # and statecode is your target state code
            pass # this is where you do something with the data

With this pattern you implement pretty much any solution that involves incrementing through the csv line by line. This includes writing to a new csv, working with the data in memory, pandas or dask solutions, and so on.

Your system very may well be able to handle just directly importing to pandas or dask if that's what you want. These are popular data munging python tools with built-in read_csv methods.

Upvotes: 2

Andreas
Andreas

Reputation: 2521

Regarding processing a large amount of data in Python, one of the most recommended library is Pandas. It is a Python library, specialized for data analysis, and is able to parse through a large amount of data using DataFrame object.

Upvotes: 1

Related Questions