Reputation: 171
I'm trying to write a WHERE clause that will filter data differently for respective entities, based on time frames that are unique to each entity, like so:
import pandas as pd
product_dates = [
{'name': 'product_1', 'start_date': '2024-01-01', 'end_date': '2024-05-01'},
{'name': 'product_2', 'start_date': '2024-03-05', 'end_date': '2024-04-26'},
{'name': 'product_3', 'start_date': '2024-02-09', 'end_date': '2024-11-08'}
]
df_product = pd.DataFrame(product_dates)
What I'd like to see for the output is the number of sales from each product that are between the given dates, like so:
output_data = [
{'name': 'product_1', 'time_frame': '2024-01-01 to 2024-05-01', 'sales': 204},
{'name': 'product_2', 'time_frame': '2024-03-05 to 2024-04-26', 'sales': 1890},
{'name': 'product_3', 'time_frame': '2024-02-09 to 2024-11-08', 'sales': 766}
]
df_output = pd.DataFrame(output_data)
And to get that, there's a table of product sales structured like this:
sales_data = [
{'name': 'product_1', 'date': '2024-01-01', 'sales': 1},
{'name': 'product_2', 'date': '2024-01-01', 'sales': 3},
{'name': 'product_3', 'date': '2024-01-01', 'sales': 4},
...
]
df_sales = pd.DataFrame(sales_data)
Theoretically the SQL clause would be:
SELECT name, time_frame, SUM(quantity) AS sales
FROM sales_data
WHERE date BETWEEN start_date AND end_date
GROUP BY ALL
But I'm not sure how to have the WHERE clause reflect the fact that the dates are variable by product.
Any tips? Thank you!
Upvotes: 1
Views: 64
Reputation: 3505
I am basing the logic on @Dale's comment,I am not too sure whether expectation is to have answer in pandas or sql as your expected output is in pandas but I guess there is already an answer for SQL, so I added for pandas.
df_product = pd.DataFrame(product_dates)
df_sales = pd.DataFrame(sales_data)
df_product['start_date'] = pd.to_datetime(df_product['start_date'])
df_product['end_date'] = pd.to_datetime(df_product['end_date'])
df_sales['date'] = pd.to_datetime(df_sales['date'])
def filter_sales(row):
product_name = row['name']
start_date = row['start_date']
end_date = row['end_date']
filtered_sales = df_sales[(df_sales['name'] == product_name) &
(df_sales['date'] >= start_date) &
(df_sales['date'] <= end_date)]
total_sales = filtered_sales['sales'].sum()
return total_sales
df_product['sales'] = df_product.apply(filter_sales, axis=1)
df_product['time_frame'] = df_product.apply(lambda row: f"{row['start_date'].date()} to {row['end_date'].date()}", axis=1)
df_output = df_product[['name', 'time_frame', 'sales']]
print(df_output)
which for the sample data
product_dates = [
{'name': 'product_1', 'start_date': '2024-01-01', 'end_date': '2024-05-01'},
{'name': 'product_2', 'start_date': '2024-03-05', 'end_date': '2024-04-26'},
{'name': 'product_3', 'start_date': '2024-02-09', 'end_date': '2024-11-08'}
]
sales_data = [
{'name': 'product_1', 'date': '2024-01-01', 'sales': 1},
{'name': 'product_1', 'date': '2024-02-01', 'sales': 3},
{'name': 'product_1', 'date': '2024-06-01', 'sales': 4},
{'name': 'product_2', 'date': '2024-03-06', 'sales': 2},
{'name': 'product_2', 'date': '2024-04-25', 'sales': 3},
{'name': 'product_3', 'date': '2024-02-10', 'sales': 5},
{'name': 'product_3', 'date': '2024-11-09', 'sales': 6}
]
generates
name time_frame sales
0 product_1 2024-01-01 to 2024-05-01 4
1 product_2 2024-03-05 to 2024-04-26 5
2 product_3 2024-02-09 to 2024-11-08 5
Upvotes: 1
Reputation: 302
As per Dale K.s comment, In SQL you can use a JOIN for the sales_data and product_dates, this should create a link that you can use to filter the sales to a start date and end date then use SUM to add up the amount sold in that date range.
It could look something like this:
SELECT
s.name,
CONCAT(p.start_date, ' to ', p.end_date) AS time_frame,
SUM(s.sales) AS sales
FROM sales_data s
JOIN product_dates p ON s.name = p.name
WHERE s.date BETWEEN p.start_date AND p.end_date
GROUP BY s.name, p.start_date, p.end_date;
Upvotes: 5