Theo F
Theo F

Reputation: 1365

Delete columns that contain a certain string from multiple CSVs

I have a Windows directory containing 100s of CSVs. Each CSV is between 500mb-1GB in size.

Every CSV has 100s of columns and every CSV shares the same column names for the first n columns and the last n columns. Every column inbetween these starts with the characters "R20" (ie. R20150311). The exact name of these "R20..." columns varies between each CSV.

Example of headers:

id,name,location,R20150311,R20150311,R20150311,R20150311,etc etc....., date,user

Using Python I want to cycle through every CSV in the directory, deleting columns where the column name conatins the text "R20"...

I don't want new output CSVs to be created, rather I'd like to overwrite the original to save on disk space.

Upvotes: 0

Views: 660

Answers (3)

Jakob Guldberg Aaes
Jakob Guldberg Aaes

Reputation: 844

It's good practice to show what you have tried and what resources you have found so far, instead of asking before even trying. Now, that is out of the way and we can look at a possible solution


As you said we have to look in each file and find the columns we would like to drop. The files can easily be found with the glob tool by glob(*.csv).

We can then proceed to load the data with pandas as the data have headers with pd.read_csv('filename'). The unwanted columns are filtered with the built-in filter function which looks for columns that do not start with the unwanted phrase df.filter(regex='^(?!RE20*)'). Here the ^ means to begin with, the (?! ...) defines a negative lookahead, so we won't match unwanted column names. Using the built-in functions instead of manually iterating through also makes sure you access the columns in the most efficient way, compared to extracting the columns and iterating through them.

Finally, save the output to the same name as the input with df.to_csv(fname).

Putting it all together we get the following

import pandas as pd
from glob import glob

fnames = glob('*.csv')

for fname in fnames:
    df = pd.read_csv(fname)
    df =  df.filter(regex='^(?!RE20*)')
    df.to_csv(fname)
    del df

A final word I would recommend looking up vaex, which is in incredible new tool withing datamining for python. If you convert your data to hdf5 vaex can manage your data without having to load it to memory until you actually have to, which will increase you workflow tremendously, as data does not have to be passed around between calls.

Upvotes: 3

Ji Wei
Ji Wei

Reputation: 881

Let files be the list of csv files in the directory. i.e. files = os.listdir()

for file in files:
    df = pd.read_csv(file)
    cols = df.columns.tolist()
    for col in cols:
        if col.startswith('R20'):
            del df[col]
    df.to_csv(file)
    del df   # this should delete the dataframe from memory, freeing up space for the next file.

Upvotes: 1

Umar.H
Umar.H

Reputation: 23099

IIUC, something like this, I'd be wary of writing over your original files though, so I've added an edited cursor. you can change this after you're done.

from pathlib import Path

p = Path(path_to_your_files)

log = {}  


files = [f for f in p.glob('*.csv')]
for file in files:
    df_cols = pd.read_csv(file,nrows=1)
    cols = df_cols.filter(regex='^(?!RE20*)').columns
    df = pd.read_csv(file,usecols=cols)
    df.to_csv(file.parent.joinpath(f"{file.stem}_edited.csv"),index=False)
    log[file] = 'updated'

Upvotes: 1

Related Questions