Reputation: 7575

Python: How to parse places in place?

Does not have any headers with several columns as such:

john, 1, orange, chicken
mary, 7, purple, hamburger
joey, 2, yellow, chicken

Upvotes: 2

Answers (4)

Jim Dennis

Reputation: 17500

While you could use the Python module: csv to instantiate your own reader, process each row into a tuple, then iterate over the tuple to select all items where a given column matches your criteria ...

... you could also do this more succinctly by using the Pandas library, which is built over NumPy, which is, in turn, part of the SciPy ecosystem.

It might seem like this is a lot of code to install and import for such a simple operation. But that begs the question: are you just trying to do something very specific and on a small scale? Or are you writing something which might have to scale over much larger data sets and eventually support much more complex analyses or queries?

Learning Pandas and NumPy is well worth the investment. Installing it is fairly easy and you can then use it any time you're dealing with data in arrays, series, or "frames" (spreadsheet like tables, labeled and indexed, etc).

Here's how you could do with with Pandas:

import pandas as pd
data_file = './mydata.csv'  ## <-- Change this to match yours
data = pd.read_csv(data_file)
rows = data[data.loc[:,3]=='chicken']
## Use rows here

Nota Bene: you do have to ensure that your CSV data is "clean" or you have make sure that you specify values as they actually exist in your CSV file. For example, when I cut and pasted your sample data into a local file, I failed to account for the leading spaces you had after your commas.

Of course you could select ==' chicken' (note the leading space) as your match criteria or you could use a Pandas DataFrame .map() to select the data thus:

...
data[data.loc[:,3].map(lambda x: 'chicken' in x)]

... this will match any row of data where column 3 (contains the string 'chicken'). But that mapping is hard to read and probably less efficient.

These examples may seem a bit hard to read, especially if you're reading other examples using Pandas. That's because one conventionally is dealing with labeled DataFrames. You could easily add your own labels to the data, in memory, by simply assigning a list of names thus:

import pandas as pd
data_file = './mydata.csv'  ## <-- Change this to match yours
data = pd.read_csv(data_file)

# Add this:
data.columns=['name', 'num', 'color', 'food']
# Select with this:
rows = data[data['food']=='chicken']

... which is considerably more readable.

One advantage of using Pandas for something like this is that it can efficiently handle fairly large data sets; tens, even hundreds, of thousands of rows should present no problems on a modern system with reasonable available memory.

To process that indexing expression (data['food']=='chicken') Pandas generates a function which is "broadcast" across the data structure, yielding a list of booleans which is then used to select all the matching rows. Must of the operations can be rendered into a low-level machine form which is much more efficient than iterating over the data row by row, column by column up in the interpreter.

Another, completely different approach, would be to read the CSV data and marshal it into your own SQL database. Just as Python includes a CSV module, it also includes, in the standard libraries, a complete SQLite3 subsystem.

This may seem considerably more involved than just processing the data as you iterate over it, or using Pandas or read the data into a DataFrame and using its features. On the other hand it will also allow you to save the data in a form that's considerably more robust and efficient. Every time you insert or update your data from the CSV source into your SQLite database file, you can then re-use that data for all manner of queries and other operations.

Here's an example:

import csv
import sqlite3

database = ':memory:'  ## <-- replace this is a filename to make the data persisten!
conn = sql.connect(':memory:')
db = conn.cursor()
results = db.execute('CREATE TABLE IF NOT EXISTS data (name VARCHAR, num INTEGER, color VARCHAR, food VARCHAR)')
## Could use data.to_sql('data', conn) if you've already created
## the data DataFrame as in the previous example

stmt_ins = 'INSERT INTO data (name, num, color, food) VALUES (?, ?, ?, ?)'
## Could use less robust: stmt_ins = "INSERT INTO data VALUES (?, ?, ?, ?)"

with open(data_file) as f:
    reader = csv.reader(f)
    for row in reader:
        if len(row) != 4:
            # Skip errors
            continue
        try:
            row[1] == int(row[1])
        except ValueError, err:
            # Skip errors
            continue
        results = None
        try:
            results = db.execute(stmt_ins, row)
        except sqlite3.Error:
            # Skip errors
            continue
        if results.rowcount != 1:
            # Ignore the error
            pass

rows = db.execute("SELECT * FROM data WHERE food = 'chicken'").fetchall()
# db.close(); conn.close()

You might find that intimidating at first. The with statement is just setting up the file reading context; so our file will automatically be closed on exit from the subsequent suite (block) of code.

Within that context we set up a simple CSV reader then iterate over each row. That's the skeleton for handling any data with the CSV module.

Within our loop we simply skip any incomplete records, try convert the second field to an integer, skip any record where that fails, try to execute our SQL INSERT statement, skip any errors, and check to see if we inserted a row (and ignore any error there).

Of 24 lines of code, almost half of these are just there to skip any errors in our data. I've included them here so you know how and where you could add more useful error reporting or handling.

Even the explicit conversion from string to integer is optional with SQLite3 (given my database connection and schema here). However, I'm showing the work in my example as it would be necessary in most cases when using other databases or even other settings in SQLite.

As noted in the comments, this code would create and manipulate the data table in memory. Just replace :memory: with an arbitrary filename and you'd be inserting your data into a file; uncomment the last comment line and you'll be cleanly saving your database and closing your connection.

There are a number of more subtle nuances to this code example. These are the sorts of things that are often ignored in beginner level tutorials and have to be learned later, usually from bitter experiences with errors and messy, real world data sets.

The "IF NOT EXISTS" in our ''CREATE TABLE'' command allows this code to work cleanly even if we have saved data in our database file from a previous session and we're adding more data to it. The stmt_ins explicitly lists the columns so it would work correctly even if you've added other columns, such as an auto-incrementing primary key, to your table. (I recommend that robust code should always provide explicit column designators for SQL SELECT, INSERT and UPDATE statements so that you code can continue to work with a wide range of database schema changes).

This code snippet can be used as a skeleton for creating an SQLite database from any CSV file. Of course you might create a more elaborate database schema, with common referential integrity constraints such as PRIMARY KEY, NOT NULL, and UNIQUE, and foreign key REFERENCES, as well as automated handling of the primary keys, composite keys, triggers and custom check constraints and so on. You could add indexes and, of course, your resulting database can efficiently support much more elaborate queries and operations.

Just as Pandas/NumPy can parse fancy indexing and other operations into lower level code which is broadcast over arrays ("vectorized"), SQLite3 will parse SQL queries and some other correctly coded statements using it's "query planner." Just as some array operations have to be expressed in forms amenable to NumPy "broadcast" (often using ufuncs one frequently has to craft SQL for use with the .executemany() method.

With SQLite3, if you use data files, you'll be able to process data sets far larger than what you can fit in your system's RAM (memory). This code example would handle INSERT processing (creation of the database) for arbitrarily sized data sets (limited by your disk space). However if the number of "chicken" rows was too large you'd have to change call to .fetchall() into a loop over .fetchmany(xxx) ... where xxx is the number of rows you can handle at each iteration.

Again, that's the sort of consideration that isn't covered by introductory tutorials, but is annoying to have to learn later after you've already written code that's hitting these sort of limits.

Of course you could also combine these bits of advice. If you process your CSV data into an SQL database then you could read the data into Pandas using *data = pd.read_sql('SELECT * FROM data', conn)* (where "conn" is the established connection to any Python DBAPI compatible database or SQLAlchemy "selectable" object, and "data" is the name of our table for view). (As noted in one of the comments you could have exported the data from Pandas to your SQL with a single method call. Pandas will create a reasonable schema for you based on the DataFrame's metadata. (Of course Pandas is for handling the data in memory ... for much larger data sets either use batch processing on data in files, or look at Apache PySpark or other means by which you can distribute you Python computations across a cluster (IPython Parallel for example).

I realize I'm going way beyond what you asked. But one point of using Python is to leverage the tools provided by the ecosystem. Simpler answer to your question won't provide you, and others whose searches bring them to this question, with the means to discover just how rich and deep that ecosystem is.

Upvotes: 1

Alexander

Reputation: 109546

You don't need csv for this, just simple python. This solution is very efficient with regards to memory, as it just operates row by row.

with open('temp.csv', 'r') as fin, open('temp2.csv', 'w') as fout:
    for row in fin:
        if row.split()[3] == 'chicken':
            fout.write(row)

This is much more memory efficient than a list comprehension which requires reading and storing in memory all the data that would be written to the file out.

Timings

# Create sample data file of 30k rows, 2/3rds with 'chicken'.
with open('temp.csv', 'w') as f:
    for _ in range(10000):
        f.write("john,1,orange,chicken\n")
        f.write("mary,7,purple,hamburger\n")
        f.write("joey,2,yellow,chicken\n")

%%timeit
with open('temp.csv', 'r') as fin, open('temp2.csv', 'w') as fout:
    for row in fin:
        if row.split()[3] == 'chicken':
            fout.write(row)
# 10 loops, best of 3: 41 ms per loop

%%timeit
with open('temp.csv') as old_file, open('temp3.csv', 'w') as new_file:
    cr = csv.reader(old_file)
    cw = csv.writer(new_file)
    cw.writerows([r for r in cr if r[3] == 'chicken'])
# 10 loops, best of 3: 58.7 ms per loop

# Turn previous list comprehension into a generator.
%%timeit
with open('temp.csv') as old_file, open('temp3.csv', 'w') as new_file:
    cr = csv.reader(old_file)
    cw = csv.writer(new_file)
    cw.writerows((r for r in cr if r[3] == 'chicken'))
# 10 loops, best of 3: 66.5 ms per loop

with open('temp.csv') as input_file, open('temp4.csv', 'w', newline='') as output_file:
    reader = csv.reader(input_file, delimiter=',')
    writer = csv.writer(output_file, delimiter=',')
    writer.writerows(filter(lambda x: x[3].strip() != 'chcicken', reader))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-41-072d37a3ff92> in <module>()
----> 1 with open('temp.csv') as input_file, open('temp4.csv', 'w', newline='') as output_file:
      2     reader = csv.reader(input_file, delimiter=',')
      3     writer = csv.writer(output_file, delimiter=',')
      4     writer.writerows(filter(lambda x: x[3].strip() != 'chcicken', reader))

TypeError: 'newline' is an invalid keyword argument for this function

Upvotes: 2

stachel

Reputation: 135

import csv

with open('input.csv') as input_file, open('output.csv', 'w', newline='') as output_file:
    reader = csv.reader(input_file, delimiter=',')
    writer = csv.writer(output_file, delimiter=',')
    writer.writerows(filter(lambda x: x[3].strip() != 'chicken', reader))

Firstly I'm using context manager to properly close files. Then you have two objects for read and write csv. Last line rewrite lines with chicken in fourth column without loading whole file to memory.

Upvotes: 1

Danil Speransky

Reputation: 30453

You could use csv from the standard library:

import csv

with open('old.csv') as old_file, open('new.csv', 'w') as new_file:
    cr = csv.reader(old_file)
    cw = csv.writer(new_file)
    cw.writerows([r for r in cr if r[3] == 'chicken'])

Upvotes: 1

Python: How to parse places in place?

Answers (4)

Related Questions