Reputation: 149
I have several CSV files that I need to read through inside of python3 using Pandas. I am trying to find which column (or index) of each CSV contains a column with dates. Two caveats:
I have already began iterating through them using Pandas like so, and I can pretty quickly identify if a column contains dates using the datetime parse function. Im having trouble going column by column (through each header to check for words like "date") as well as through each row in that column to see if I can parse it as a date. Ideas?
Example CSV:
Table
DATE ,TOPIC
Aug 21 ,"Some text here"
Aug 23 ,Some other text here, it isn't always in quotes
Current code for iterating through the rows:
import pandas as pd
df = pd.read_csv(filePath)
for i, row in enumerate(df.itertuples(), 0):
print(i, row.Index)
Edit: Here is my current code for iterating through each row in the Index column of a file. This works when the dates are in the index, but I need to loop through each column (not just the index) to see which one contains the dates.
for i, row in enumerate(df.itertuples(), 0):
for keyword in dateKeywords:
if keyword.lower() in row.Index.lower():
foundColumn = True
if foundColumn:
for days in dayAndMonthList:
if days in row.Index:
canParse = is_date(row.Index)
print(canParse)
The above code loops through the Dataframe, and looks for keywords like "date" or "week" to see a potential dates column. If it finds one, it then goes through to see if the rows in that column all contain the name (or abbreviation) of a month). This works to parse dates in the index column, but how can I make it shift over and test each column for the keyword, and then loop through the rows in that column to check if they're parseable?
Upvotes: 2
Views: 388
Reputation: 2015
Retrieving index/column pairs that refer to cells containing dates in a DataFrame:
# random setup
import pandas as pd
from dateutil.parser import parse
df = pd.DataFrame({
"DATE": ['15/05/1965', '02/15/1994', 'Aug 23'],
"MIX": ['ABC', 'DEF', '02/05/1995'],
"TRICKERY": ['some chars', 'obfuscating 12/12/2000 the date', 'more chars'],
"NAT": ['not', 'at', 'all']
})
# solution
def get_date(s):
try:
parse(s, fuzzy=True)
return True
except ValueError:
return False
# get a boolean dataframe with True values indicating a cell is a date
df = df.applymap(get_date)
# get a list of tuples containing an index and a column name
parsable = df[df].stack().index.tolist()
>>> df
DATE MIX NAT TRICKERY
0 True False False False
1 True False False True
2 True True False False
>>> parsable
[(0, 'DATE'), (1, 'DATE'), (1, 'TRICKERY'), (2, 'DATE'), (2, 'MIX')]
Upvotes: 1