Reputation: 861

how to remove a column from Pandas dataframe using Python?

I have the following code (Python 2.7):

df = pd.DataFrame()
pages = [i for i in range(1, int(math.ceil(reports.get_reports_count()/page_size)+1))]
with ThreadPoolExecutor(max_workers=len(pages)) as executor:
    futh = [executor.submit(reports.fill_dataframe, page) for page in pages]
    for data in as_completed(futh):
        df = df.append(data.result(), ignore_index=True)
cuttent_time = datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S')
df["timestamp"] = cuttent_time
df.columns = [c.lower().replace(' ', '_') for c in df.columns]
df = df.replace(r'\n', ' ', regex=True)
file_name = "{0}.csv.gz".format(tab_name)
df.to_csv(path_or_buf=file_name, index=False, encoding='utf-8',
          compression='gzip',
          quoting=QUOTE_NONNUMERIC)

This creates a compressed csv file from the data stream. Now, I want to make sure that the column in the file are the ones I expect (order does not matter). Meaning that if for any reason the data stream contains more columns than this columns will be removed. Note that I add a column of my own to the data stream called timestamp.

The allowed columns are:

cols_list = ['order_id', 'customer_id', 'date', 'price']

I'm aware that there is del df['column_name'] option but this doesn't work for me as I have no idea what will be the redundant column name.

I'm looking for something like:

if col_name not it cols_list:
   del df[???]  #delete column and it's data.
   print [???]  #print the name of the redundant column for log

I think there are two approaches here:

not to add the redundant column to the df in the first place.
remove the redundant column after the df.append is finished.

I prefer the 1st option as it should be with better performance (?)

One of my attempts was:

for i, data in enumerate(df):
        for col_name in cols_list:
            if col_name not in data.keys():
               del df[col_name ]

but it doesn't work..

if col_name not in data.keys(): AttributeError: 'str' object has no attribute 'keys'

I'm not sure I enumerate over df itself

Upvotes: 0

Answers (4)

Joe

Reputation: 12417

If you want to make your attempt with for loop works, try this:

for col_name in df.columns:
    if col_name not in cols_list:
       del df[col_name]

Upvotes: 1

jezrael

Reputation: 863351

I think need intersection by list of column namess and then filter by subset with []:

cols_list = ['order_id', 'customer_id', 'date', 'price']
cols = df.columns.intersection(cols_list)
df = df[cols]

Upvotes: 0

Paula Thomas

Reputation: 1190

According to the Pandas documentation for the function read_csv at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html there is a parameter 'usecols' which is described:

usecols : list-like or callable, default None

Return a subset of the columns. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). For example, a valid list-like usecols parameter would be [0, 1, 2] or [‘foo’, ‘bar’, ‘baz’]. Element order is ignored, so usecols=[0, 1] is the same as [1, 0]. To instantiate a DataFrame from data with element order preserved use pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']] for columns in ['foo', 'bar'] order or pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.

If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True. An example of a valid callable argument would be lambda x: x.upper() in ['AAA', 'BBB', 'DDD']. Using this parameter results in much faster parsing time and lower memory usage.

This is the answer to your problem.

Upvotes: 0

Thijs van Ede

Reputation: 917

Removing the redundant column after the df.append is finished is quite simple:

df = df[cols_list]

As for the first suggestion, you could apply the statement described above before appending it to the df. However, you should note that this requires a pandas DataFrame object, so you would probably need to transform the data.result() to a pandas Dataframe first.

Upvotes: 0

how to remove a column from Pandas dataframe using Python?

Answers (4)

Related Questions