Reputation: 335
I'm new to Python and Pandas, and I've pulled in a database table that contains 15+ different datetime columns. My task is to sort these columns generally by earliest to latest value in the rows. However, the data is not clean; sometimes, where Column A's date would come before Column B's date in Row 0, A would come after B in Row 1.
I wrote a few functions (redacted here for simplicity) that compare two columns by calculating the percentage of times dates in A come before and after B, and then sorting the columns based on that percentage:
def get_percentage(df, df_subset):
return len(df_subset)/float(len(df))
def duration_report(df, earlier_column, later_column):
results = {}
td = df[later_column] - df[earlier_column]
results["Before"] = get_percentage(df, df.loc[td >= pd.Timedelta(0)])
results["After"] = get_percentage(df, df.loc[td <= pd.Timedelta(0)])
ind = "%s vs %s" % (earlier_column, later_column)
return pd.DataFrame(data=results, index=[ind])
def order_date_columns(df, col1, col2):
before = duration_report(df, col1, col2).Before.values[0]
after = duration_report(df, col1, col2).After.values[0]
if before >= after:
return [col1, col2]
else:
return [col2, col1]
My goal with the above code is to programmatically implement the following:
If Col A dates come before Col B dates 50+% of the time, Col A should come before Col B in the list of earliest to latest datetime columns.
The order_date_columns()
function successfully sorts two columns into the correct order, but how do I apply this sorting to the 15+ columns at once? I've looked into df.apply()
, lambda
, and map()
, but haven't been able to crack this problem.
Any help (with code clarity/efficiency, too) would be appreciated!
Upvotes: 2
Views: 1039
Reputation: 2872
Since you're using Python 2.7, you can use the cmp
keyword argument to sorted
. To get the column names in the order that you're looking for, I would do something like:
# Returns -1 if first_column[i] > second_column[i] more often.
# Returns 1 if vice versa.
# Returns 0 if equal.
# Assumes df[first_column] and df[second_column] are the same length.
def compare_two(first_column, second_column):
c1_greater_count = 0
c2_greater_count = 0
# Iterate over the two columns in the dataframe. df must be in accessible scope.
for i in range(len(df[first_column])):
if df[first_column].iloc(i) > df[second_column].iloc[i]:
c1_greater_count += 1
elif df[second_column].iloc[i] > df[first_column].iloc[i]:
c2_greater_count += 1
if c1_greater_count > c2_greater_count:
return -1
if c2_greater_count > c1_greater_count:
return 1
return 0
df = get_dataframe_from_somewhere()
relevant_column_names = get_relevant_column_names(df) # e.g., get all the dates.
sorted_column_names = sorted(relevant_column_names, cmp=compare_two)
# sorted_column_names holds the names of the relevant columns,
# sorted according to the given ordering.
I'm sure there's a more Pythonic way to do it, but this should work. Note that for Python 3, you can use the cmp_to_key
utility.
Upvotes: 1
Reputation: 142
If you don't mind taking a bit of a shortcut and using the median of each date column, this should work:
def order_date_columns(df, date_columns_to_sort):
x = [(col, df[col].astype(np.int64).median()) for col in date_columns_to_sort]
return [x[0] for x in sorted(x, key=lambda x: x[1])]
Upvotes: 2