Reputation: 3379
I am importing data into a pandas dataframe from Google BigQuery and I'd like to sort the results by date. My code is as follows:
import sys, getopt
import pandas as pd
from datetime import datetime
# set your BigQuery service account private private key
pkey ='#REMOVED#'
destination_table = 'test.test_table_2'
project_id = '#REMOVED#'
# write your query
query = """
SELECT date, SUM(totals.visits) AS Visits
FROM `#REMOVED#.#REMOVED#.ga_sessions_20*`
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 3 day) and
DATE_sub(current_date(), interval 1 day)
GROUP BY Date
"""
data = pd.read_gbq(query, project_id, dialect='standard', private_key=pkey, parse_dates=True, index_col='date')
date = data.sort_index()
data.info()
data.describe()
print(data.head())
My output is shown below, as you can see dates are not sorted.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
date 3 non-null object
Visits 3 non-null int32
dtypes: int32(1), object(1)
memory usage: 116.0+ bytes
date Visits
0 20180312 207440
1 20180310 178155
2 20180311 207452
I have read several questions and so far tried the below, which resulted in no change to my output:
index_col='date'
and adding date = data.sort_values(by='date')
headers = ['Date', 'Visits']
) and dypes (dtypes = [datetime, int]
) to my read_gbq line (parse_dates=True, names=headers
)What am I missing?
Upvotes: 1
Views: 6850
Reputation: 3379
I managed to solve this by transforming my date field into a datetime object, I assumed this would be done automatically by parse_date=True
but it seems that will only parse a existing datetime object.
I added the following after my query to create a new datetime column from my date string, then I was able to use data.sort_index()
and it worked as expected:
time_format = '%Y-%m-%d'
data = pd.read_gbq(query, project_id, dialect='standard', private_key=pkey)
data['n_date'] = pd.to_datetime(data['date'], format=time_format)
data.index = data['n_date']
del data['date']
del data['n_date']
data.index.names = ['Date']
data = data.sort_index()
Upvotes: 2
Reputation: 210982
As most of the work is done on the Google BigQuery side, I'd do sorting there as well:
query = """
SELECT date, SUM(totals.visits) AS Visits
FROM `#REMOVED#.#REMOVED#.ga_sessions_20*`
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 3 day) and
DATE_sub(current_date(), interval 1 day)
GROUP BY Date
ORDER BY Date
"""
Upvotes: 1