matt
matt

Reputation: 97

Set Timestamp Column in CSV as Index and Parse Dates Using Python and Pandas

I have a Python script using pandas that takes web-scraped data on COVID-19 from CSVs compressed in ZIP files. This is original data source of web-scraped data: https://github.com/statistikat/coronaDAT

I am having trouble with the Timestamp column that I load from the CSV files. The data appears to load properly into the DataFrame with all five columns from the original CSV files. The fifth column is the Timestamp of the data. When I use print(df_master.columns) I get the correct five columns, including the Timestamp.

Here is what I get from

print(df_master.info())
print(df_master.head(10))
print(df_master.columns)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 903 entries, 87 to 87
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   Bezirk           903 non-null    object
 1   Anzahl           903 non-null    int64
 2   Anzahl_Inzidenz  903 non-null    object
 3   GKZ              859 non-null    float64
 4   Timestamp        859 non-null    object
dtypes: float64(1), int64(1), object(3)
memory usage: 42.3+ KB
None
         Bezirk  Anzahl   Anzahl_Inzidenz    GKZ            Timestamp
87  Wien(Stadt)    2231   117,57631524998  900.0  2020-04-22T06:00:00
87  Wien(Stadt)    2264  119,315453933642  900.0  2020-04-22T19:00:00
87  Wien(Stadt)    2243  118,208729316766  900.0  2020-04-22T12:00:00
87  Wien(Stadt)    2254   118,78844221132  900.0  2020-04-22T16:00:00
87  Wien(Stadt)    2242  118,156028144534  900.0  2020-04-22T09:00:00
87  Wien(Stadt)    2266  119,420856278106  900.0  2020-04-22T23:00:00
87  Wien(Stadt)    2231   117,57631524998  900.0  2020-04-22T02:00:00
87  Wien(Stadt)    2256  118,893844555784  900.0  2020-04-22T18:00:00
87  Wien(Stadt)    2237  117,892522283373  900.0  2020-04-22T07:00:00
87  Wien(Stadt)    2244  118,261430488998  900.0  2020-04-22T13:00:00
Index(['Bezirk', 'Anzahl', 'Anzahl_Inzidenz', 'GKZ', 'Timestamp'], dtype='object')
Export to CSV Successful

However, when I try to set the DataFrame index to the Timestamp column (index_col=['Timestamp']), or parse the dates of the Timestamp column (parse_dates=['Timestamp']), I the following error message:

ValueError: Index Timestamp invalid

I tried specifying the exact columns in the CSV, but that didn't make a difference. Some of the CSV files being read may have no value or strings with no value in the Timestamp column. I tried replacing any empty strings in the Timestamp column with NaN and then dropping all NaN, which would remove all rows with no value in the Timestamp column. I also tried setting the data type for the Timestamp column to datetime.

Set empty strings in TimeStamp column to NaN and drop rows:

#replace empty strings in Timestamp column with NaN values
                df['Timestamp'].replace('', np.nan, inplace=True)
                #replace whitespace in Timestamp column with NaN values
                df['Timestamp'].replace('  ', np.nan, inplace=True)
                #drop rows where Timestamp column has NaN values 
                df.dropna(subset=['Timestamp'], inplace=True)

Set data type to datetime:

pd.to_datetime(df['Timestamp'],errors='ignore')

When I do either of these two things, I get the error message:

KeyError: 'Timestamp'

Any ideas why I can't do anything to the Timestamp column, like set as index, parse dates, or do anything to values in that column?

Here is the full code:

import fnmatch
import os
import pandas as pd
import numpy as np
from zipfile import ZipFile


#set root path
rootPath = r"/Users/matt/test/"

#set file extension pattern - get all ZIPs with data from 10:00 AM
pattern_ext = '*00_orig_csv.zip'

#set file name - get all CSVs with data from Bezirke
pattern_filename = 'Bezirke.csv'
#set Bezirk to export to CSV
set_bezirk = 'Wien(Stadt)'

#initialize variables
df_master = pd.DataFrame()
flag = False


#crawl entire directory in root folder
for root, dirs, files in os.walk(rootPath):
    #filter files that match pattern of .zip
    for filename in fnmatch.filter(files, pattern_ext):
        #create complete file name of ZIP file
        zip_file = ZipFile(os.path.join(root, filename))
        for text_file in zip_file.infolist():
            #if the filename starts with variable file_name
            if text_file.filename.startswith(pattern_filename):
                df = pd.read_csv(zip_file.open(text_file.filename), 
                    delimiter = ';', 
                    header = 0, 
                    #index_col = 'Timestamp', 
                    #parse_dates = 'Timestamp'
                    )

                #set data type of Timestamp column to datetime
                #pd.to_datetime(df['Timestamp'],errors='ignore') 

                #replace empty strings in Timestamp column with NaN values
                #df['Timestamp'].replace('', np.nan, inplace=True)
                #replace whitespace in Timestamp column with NaN values
                #df['Timestamp'].replace('  ', np.nan, inplace=True)
                #drop rows where Timestamp column has NaN values 
                #df.dropna(subset=['Timestamp'], inplace=True)


                #filter for Bezirk values that equal variable set_bezirk
                df_vienna = df[df['Bezirk'] == set_bezirk]

                ##filter for Timestamp values that equal variable set_time
                #df_vienna = df[df['Timestamp'] != 0]

                #insert filtered values for variable set_bezirk to dataframe df
                df = df_vienna
                if not flag:
                    df_master = df
                    flag = True
                else:
                    df_master = pd.concat([df_master, df])

#sort index field Timestamp
df_master.set_index('Timestamp').sort_index(inplace=True, na_position='first')

#print master dataframe info
print(df_master.info())
print(df_master.head(10))
print(df_master.columns)


#prepare date to export to csv
frame = df_master

#export to csv
try:
    frame.to_csv( "combined_zip_Bezirk_Wien.csv", encoding='utf-8-sig')
    print("Export to CSV Successful")
except:
    print("Export to CSV Failed")


#verify if the dataset is present
    #if not present, download data set from GitHub
    #if present, verfify with GitHUb if dataset is updated
        #update dataset

Upvotes: 0

Views: 1096

Answers (1)

Hamza Zubair
Hamza Zubair

Reputation: 1410

Use

df2 = pd.to_datetime(df_master['Timestamp'], format="%Y-%m-%dT%H:%M:%S")

to convert to a timestamp column, then do your processing

Upvotes: 1

Related Questions