Reputation: 97
I have a Python script using pandas that takes web-scraped data on COVID-19 from CSVs compressed in ZIP files. This is original data source of web-scraped data: https://github.com/statistikat/coronaDAT
I am having trouble with the Timestamp column that I load from the CSV files. The data appears to load properly into the DataFrame with all five columns from the original CSV files. The fifth column is the Timestamp of the data. When I use print(df_master.columns)
I get the correct five columns, including the Timestamp.
Here is what I get from
print(df_master.info())
print(df_master.head(10))
print(df_master.columns)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 903 entries, 87 to 87
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Bezirk 903 non-null object
1 Anzahl 903 non-null int64
2 Anzahl_Inzidenz 903 non-null object
3 GKZ 859 non-null float64
4 Timestamp 859 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 42.3+ KB
None
Bezirk Anzahl Anzahl_Inzidenz GKZ Timestamp
87 Wien(Stadt) 2231 117,57631524998 900.0 2020-04-22T06:00:00
87 Wien(Stadt) 2264 119,315453933642 900.0 2020-04-22T19:00:00
87 Wien(Stadt) 2243 118,208729316766 900.0 2020-04-22T12:00:00
87 Wien(Stadt) 2254 118,78844221132 900.0 2020-04-22T16:00:00
87 Wien(Stadt) 2242 118,156028144534 900.0 2020-04-22T09:00:00
87 Wien(Stadt) 2266 119,420856278106 900.0 2020-04-22T23:00:00
87 Wien(Stadt) 2231 117,57631524998 900.0 2020-04-22T02:00:00
87 Wien(Stadt) 2256 118,893844555784 900.0 2020-04-22T18:00:00
87 Wien(Stadt) 2237 117,892522283373 900.0 2020-04-22T07:00:00
87 Wien(Stadt) 2244 118,261430488998 900.0 2020-04-22T13:00:00
Index(['Bezirk', 'Anzahl', 'Anzahl_Inzidenz', 'GKZ', 'Timestamp'], dtype='object')
Export to CSV Successful
However, when I try to set the DataFrame index to the Timestamp column (index_col=['Timestamp']
), or parse the dates of the Timestamp column (parse_dates=['Timestamp']
), I the following error message:
ValueError: Index Timestamp invalid
I tried specifying the exact columns in the CSV, but that didn't make a difference. Some of the CSV files being read may have no value or strings with no value in the Timestamp column. I tried replacing any empty strings in the Timestamp column with NaN and then dropping all NaN, which would remove all rows with no value in the Timestamp column. I also tried setting the data type for the Timestamp column to datetime.
Set empty strings in TimeStamp column to NaN and drop rows:
#replace empty strings in Timestamp column with NaN values
df['Timestamp'].replace('', np.nan, inplace=True)
#replace whitespace in Timestamp column with NaN values
df['Timestamp'].replace(' ', np.nan, inplace=True)
#drop rows where Timestamp column has NaN values
df.dropna(subset=['Timestamp'], inplace=True)
Set data type to datetime:
pd.to_datetime(df['Timestamp'],errors='ignore')
When I do either of these two things, I get the error message:
KeyError: 'Timestamp'
Any ideas why I can't do anything to the Timestamp column, like set as index, parse dates, or do anything to values in that column?
Here is the full code:
import fnmatch
import os
import pandas as pd
import numpy as np
from zipfile import ZipFile
#set root path
rootPath = r"/Users/matt/test/"
#set file extension pattern - get all ZIPs with data from 10:00 AM
pattern_ext = '*00_orig_csv.zip'
#set file name - get all CSVs with data from Bezirke
pattern_filename = 'Bezirke.csv'
#set Bezirk to export to CSV
set_bezirk = 'Wien(Stadt)'
#initialize variables
df_master = pd.DataFrame()
flag = False
#crawl entire directory in root folder
for root, dirs, files in os.walk(rootPath):
#filter files that match pattern of .zip
for filename in fnmatch.filter(files, pattern_ext):
#create complete file name of ZIP file
zip_file = ZipFile(os.path.join(root, filename))
for text_file in zip_file.infolist():
#if the filename starts with variable file_name
if text_file.filename.startswith(pattern_filename):
df = pd.read_csv(zip_file.open(text_file.filename),
delimiter = ';',
header = 0,
#index_col = 'Timestamp',
#parse_dates = 'Timestamp'
)
#set data type of Timestamp column to datetime
#pd.to_datetime(df['Timestamp'],errors='ignore')
#replace empty strings in Timestamp column with NaN values
#df['Timestamp'].replace('', np.nan, inplace=True)
#replace whitespace in Timestamp column with NaN values
#df['Timestamp'].replace(' ', np.nan, inplace=True)
#drop rows where Timestamp column has NaN values
#df.dropna(subset=['Timestamp'], inplace=True)
#filter for Bezirk values that equal variable set_bezirk
df_vienna = df[df['Bezirk'] == set_bezirk]
##filter for Timestamp values that equal variable set_time
#df_vienna = df[df['Timestamp'] != 0]
#insert filtered values for variable set_bezirk to dataframe df
df = df_vienna
if not flag:
df_master = df
flag = True
else:
df_master = pd.concat([df_master, df])
#sort index field Timestamp
df_master.set_index('Timestamp').sort_index(inplace=True, na_position='first')
#print master dataframe info
print(df_master.info())
print(df_master.head(10))
print(df_master.columns)
#prepare date to export to csv
frame = df_master
#export to csv
try:
frame.to_csv( "combined_zip_Bezirk_Wien.csv", encoding='utf-8-sig')
print("Export to CSV Successful")
except:
print("Export to CSV Failed")
#verify if the dataset is present
#if not present, download data set from GitHub
#if present, verfify with GitHUb if dataset is updated
#update dataset
Upvotes: 0
Views: 1096
Reputation: 1410
Use
df2 = pd.to_datetime(df_master['Timestamp'], format="%Y-%m-%dT%H:%M:%S")
to convert to a timestamp column, then do your processing
Upvotes: 1