Reputation: 861
I have the following code (Python 2.7):
df = pd.DataFrame()
pages = [i for i in range(1, int(math.ceil(reports.get_reports_count()/page_size)+1))]
with ThreadPoolExecutor(max_workers=len(pages)) as executor:
futh = [executor.submit(reports.fill_dataframe, page) for page in pages]
for data in as_completed(futh):
df = df.append(data.result(), ignore_index=True)
cuttent_time = datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S')
df["timestamp"] = cuttent_time
df.columns = [c.lower().replace(' ', '_') for c in df.columns]
df = df.replace(r'\n', ' ', regex=True)
file_name = "{0}.csv.gz".format(tab_name)
df.to_csv(path_or_buf=file_name, index=False, encoding='utf-8',
compression='gzip',
quoting=QUOTE_NONNUMERIC)
This creates a compressed csv file from the data stream.
Now, I want to make sure that the column in the file are the ones I expect (order does not matter). Meaning that if for any reason the data stream contains more columns than this columns will be removed. Note that I add a column of my own to the data stream called timestamp
.
The allowed columns are:
cols_list = ['order_id', 'customer_id', 'date', 'price']
I'm aware that there is del df['column_name']
option but this doesn't work for me as I have no idea what will be the redundant column name.
I'm looking for something like:
if col_name not it cols_list:
del df[???] #delete column and it's data.
print [???] #print the name of the redundant column for log
I think there are two approaches here:
df
in the first place.df.append
is finished.I prefer the 1st option as it should be with better performance (?)
One of my attempts was:
for i, data in enumerate(df):
for col_name in cols_list:
if col_name not in data.keys():
del df[col_name ]
but it doesn't work..
if col_name not in data.keys(): AttributeError: 'str' object has no attribute 'keys'
I'm not sure I enumerate over df
itself
Upvotes: 0
Views: 717
Reputation: 12417
If you want to make your attempt with for loop
works, try this:
for col_name in df.columns:
if col_name not in cols_list:
del df[col_name]
Upvotes: 1
Reputation: 863351
I think need intersection
by list of column names
s and then filter by subset
with []
:
cols_list = ['order_id', 'customer_id', 'date', 'price']
cols = df.columns.intersection(cols_list)
df = df[cols]
Upvotes: 0
Reputation: 1190
According to the Pandas documentation for the function read_csv at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html there is a parameter 'usecols' which is described:
usecols : list-like or callable, default None
Return a subset of the columns. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). For example, a valid list-like usecols parameter would be [0, 1, 2] or [‘foo’, ‘bar’, ‘baz’]. Element order is ignored, so usecols=[0, 1] is the same as [1, 0]. To instantiate a DataFrame from data with element order preserved use pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']] for columns in ['foo', 'bar'] order or pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.
If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True. An example of a valid callable argument would be lambda x: x.upper() in ['AAA', 'BBB', 'DDD']. Using this parameter results in much faster parsing time and lower memory usage.
This is the answer to your problem.
Upvotes: 0
Reputation: 917
Removing the redundant column after the df.append
is finished is quite simple:
df = df[cols_list]
As for the first suggestion, you could apply the statement described above before appending it to the df
. However, you should note that this requires a pandas DataFrame
object, so you would probably need to transform the data.result()
to a pandas Dataframe first.
Upvotes: 0