Reputation: 1231
I have two lists of files that I'm pulling from an FTP folder using:
sFiles = ftp.nlst(date+'sales.csv')
oFiles = ftp.nlst(date+'orders.csv')
This results with two lists looking something like:
sFiles = ['20170822_sales.csv','20170824_sales.csv','20170825_sales.csv','20170826_sales.csv','20170827_sales.csv','20170828_sales.csv']
oFiles = ['20170822_orders.csv','20170823_orders.csv','20170824_orders.csv','20170825_orders.csv','20170826_orders.csv','20170827_orders.csv']
With my real data-set, something like...
for sales, orders in zip(sorted(sFiles),sorted(oFiles)):
df = pd.concat(...)
Gets my desired result, but there are going to be times where something goes wrong and both files do not make it into the proper FTP folder, so I'd like some code that will create an iterable object where I can extract the matched orders and sales file name based on date.
The following works... I'm not sure what "pythonic" score I'd give it. Poor readability, but it is a comprehension, so I'd imagine there are performance gains?
[(sales, orders) for sales in sFiles for orders in oFiles if re.search(r'\d+',sales).group(0) == re.search(r'\d+',orders).group(0)]
Upvotes: 4
Views: 1023
Reputation: 71451
You can use a dictionary:
import collections
d = collections.defaultdict(dict)
sFiles = ftp.nlst(date+'sales.csv')
oFiles = ftp.nlst(date+'orders.csv')
for sale, order in zip(sFiles, oFiles):
a, b = sale.split("_")
a1, b2 = order.split("_")
d[a]["sales"] = sale
d[a1]["orders"] = order
print(dict(d))
Now, your data is structured in the format: {"date":{"sales":"sales filename", "orders":"orders filename"}}
Output:
{'20170828': {'sales': '20170828_sales.csv'}, '20170822': {'sales': '20170822_sales.csv', 'orders': '20170822_orders.csv'}, '20170823': {'orders': '20170823_orders.csv'}, '20170824': {'sales': '20170824_sales.csv', 'orders': '20170824_orders.csv'}, '20170825': {'sales': '20170825_sales.csv', 'orders': '20170825_orders.csv'}, '20170826': {'sales': '20170826_sales.csv', 'orders': '20170826_orders.csv'}, '20170827': {'sales': '20170827_sales.csv', 'orders': '20170827_orders.csv'}}
Edit:
with dictionary comprehension and building off of your proposed list-comprehension solution, you can try this:
import re
final_data = [{"sold":sold, "order":order} for sold in sFiles for order in oFiles if re.findall("\d+", sold)[0] == re.findall("\d+", order)[0]]
Output:
[{'sold': '20170822_sales.csv', 'order': '20170822_orders.csv'}, {'sold': '20170824_sales.csv', 'order': '20170824_orders.csv'}, {'sold': '20170825_sales.csv', 'order': '20170825_orders.csv'}, {'sold': '20170826_sales.csv', 'order': '20170826_orders.csv'}, {'sold': '20170827_sales.csv', 'order': '20170827_orders.csv'}]
Upvotes: 2
Reputation: 177685
This creates a generator that returns matched pairs in date order:
from collections import defaultdict
def match(sales,orders):
# When a key is referenced for the first time, the value
# will default to the result of the lambda.
d = collections.defaultdict(lambda:[None,None])
# set sales files on the first entry in the value.
for sale in sFiles:
d[sale[:8]][0] = sale
# set orders files on the second entry.
for order in oFiles:
d[order[:8]][1] = order
for k in sorted(d):
# Both values need to exist.
# If you want the singles remove the if.
if all(v for v in d[k]):
yield d[k]
sFiles = ['20170822_sales.csv','20170824_sales.csv','20170825_sales.csv','20170826_sales.csv','20170827_sales.csv','20170828_sales.csv']
oFiles = ['20170822_orders.csv','20170823_orders.csv','20170824_orders.csv','20170825_orders.csv','20170826_orders.csv','20170827_orders.csv']
for s,o in match(sFiles,oFiles):
print(s,o)
Output:
20170822_sales.csv 20170822_orders.csv
20170824_sales.csv 20170824_orders.csv
20170825_sales.csv 20170825_orders.csv
20170826_sales.csv 20170826_orders.csv
20170827_sales.csv 20170827_orders.csv
Upvotes: 1
Reputation: 27915
Just because comprehensions exist doesn't mean you should use them for everything. This works fine:
date = re.compile(r'\d+')
for sales in sFiles:
salesDate = date.search(sales).group(0)
for orders in oFiles:
orderDate = date.search(orders).group(0)
if salesDate == orderDate:
print sales, orders
Is it possible to make it faster? Yes. But you don't need to force it into a list comprehension just because you can. Sometimes writing more code is better, just because it spreads the complexity out a little.
Here's an incremental improvement that makes the algorithm O(n):
date = re.compile(r'\d+')
orders_dict = dict((date.search(file).group(0), file) for file in oFiles)
for sales in sFiles:
salesDate = date.search(sales).group(0)
if salesDate in orders_dict:
orders = orders_dict[salesDate]
print sales, orders
else:
# what do you do if it doesn't exist? You can't put handling code
# here if you try to write this as a comprehension.
Upvotes: 1
Reputation: 476
Taking advantage of the index of the pandas DataFrame:
import pandas as pd
sFiles = ['20170822_sales.csv','20170824_sales.csv','20170825_sales.csv','20170826_sales.csv','20170827_sales.csv','20170828_sales.csv']
oFiles = ['20170822_orders.csv','20170823_orders.csv','20170824_orders.csv','20170825_orders.csv','20170826_orders.csv','20170827_orders.csv']
s_dates = [pd.Timestamp.strptime(file[:8], '%Y%m%d') for file in sFiles]
s_df = pd.DataFrame({'sFiles': sFiles}, index=s_dates)
o_dates = [pd.Timestamp.strptime(file[:8], '%Y%m%d') for file in oFiles]
o_df = pd.DataFrame({'oFiles': oFiles}, index=o_dates)
df = s_df.join(o_df, how='outer')
and so:
>>> print(df)
sFiles oFiles
2017-08-22 20170822_sales.csv 20170822_orders.csv
2017-08-23 NaN 20170823_orders.csv
2017-08-24 20170824_sales.csv 20170824_orders.csv
2017-08-25 20170825_sales.csv 20170825_orders.csv
2017-08-26 20170826_sales.csv 20170826_orders.csv
2017-08-27 20170827_sales.csv 20170827_orders.csv
2017-08-28 20170828_sales.csv NaN
Upvotes: 3