Reputation: 1035
I am trying to implement the Single Pass
algorithm to the following problem:
Please see my code below
def sp_algorithm(self, dataframe, col_dict):
# for each cursor value, get all columns that has it
for ind, i in zip(dataframe.index, range(0, len(dataframe.index))):
# vectorised index operation
cols_list = dataframe.columns[dataframe.isin([ind]).any()]
# for each column in col_dict.keys(), intersect its values with cols_list to get the remaining
# if the current column value is null, then do nothing
for key_col in col_dict.keys():
column_val = dataframe.loc[ind, key_col]
if (column_val == column_val) & (column_val != ''):
col_dict[key_col] = list(set(col_dict[key_col]).intersection(cols_list))
The dictionary containing column names look like this:
col_dict = {'col_A': ['col_B', 'col_C'], 'col_B' = ['col_A', 'col_C'], 'col_C': ['col_A', 'col_B']}
As you can see, my code is currently of O(n^2)
time complexity as there are 2 for loop
s in there.
Currently, each iteration (including the 2 x for loop
s) takes around 0.8 seconds, which is probably not a problem for a small dataset. However, the dataset I am processing is of 300k rows and over 80 columns.
My problem is: how do I implement single pass to the dictionary intersection step so there's only going to be 1 for loop instead of 2?
EDIT The dataframe will contain sorted index and values in ascending order as below:
col_A col_B col_C
index
0 nan 0 0
1 1 nan 1
2 2 nan 2
3 nan 3 3
So, my current function would loop through for each index
, ind
to get the col names cols_list = dataframe.columns[dataframe.isin([ind]).any()]
and intersect these with the dictionary values.
1st iteration:
cols_list = ['col_B', 'col_C']
then, it will look up the values for each key in the col_dict
(only if the column has a value, so for nan
it will be skipped) and intersect it and update it.
col_dict = {'col_A': ['col_B', 'col_C'], 'col_B' = ['col_C'], 'col_C': ['col_B']}
2nd iteration:
col_B
is skipped when checking dictionary value as it's nan
and it's dictionary value stays the same
cols_list = ['col_A', 'col_C']
col_dict = {'col_A': ['col_C'], 'col_B' = ['col_C'], 'col_C': []}
3rd iteration:
col_B
is skipped when checking dictionary value as it's nan
and it's dictionary value stays the same
cols_list = ['col_A', 'col_C']
col_dict = {'col_A': ['col_C'], 'col_B' = ['col_C'], 'col_C': []}
4th iteration:
col_A
is skipped when checking dictionary value as it's nan
and it's dictionary value stays the same
cols_list = ['col_B', 'col_C']
col_dict = {'col_A': ['col_C'], 'col_B' = ['col_C'], 'col_C': []}
Upvotes: 2
Views: 112
Reputation: 15310
The strength of numpy/scipy/pandas lies in using C code to do most of the work for you. Given that, in your words, there are 300k rows and 80 columns, I suggest that you first make every effort to ensure that you are processing the rows in C, not python. Thus, your first loop should be eliminated: don't process 300k elements using python.
The way I read your requirements, you have an index (row labels) that has values of some type that may appear in the individual cells in your other columns. Something like this:
Index A B C D
1 0 0 3 0
2 2 0 1 -1
3 0 0 0 0
192 0 0 1 -1
You want to know, for each Index, if that value appears in any of column A, column B, column C, etc. If any Index value appears in a column, that column is "ALIVE".
At the end of the process, columns are either ALIVE or not, and you then want to filter your other dictionary to exclude columns that are not ALIVE.
In my example above, column A is considered ALIVE because of { 2 }, and column C is considered ALIVE because of { 3, 1 }, but columns B and D are not ALIVE because they don't contain any values that are present in Index. Is this right?
Try using isin
to determine if the values in the columns are present in the index. Then use any
to collapse the boolean results to a single boolean value that determines if the column is alive:
row_labels = df.index
col_is_alive = df.isin(row_labels).any() # NB: any(index=0) is default
(NOTE: I am not in a place where I can run this code. It may contain syntax or other errors.)
Now you have a series of 80 boolean values, telling you which columns are alive. You can do your processing however you like.
alive_col_names = { name for name in df.columns if col_is_alive[name] } # Set comprehension
However, your initial problem statement makes it sound like you are doing this one time (as opposed to iteratively updating the groups of column names). In that case, rather than intersect the dictionary values (lists of every column name but the key) I would suggest simply computing the values directly. That is, compute the key->value pairs directly rather than trying to "intersect" the value lists with that list of all column names.
col_dict = { key:alive_col_names - {key} for key in alive_col_names}
On the other hand, if you are somehow iteratively updating these values, then I would suggest you make your second data structure a dictionary of string -> set instead of string -> list, since that will give you access to the standard set operations and behavior.
new_col_dict = {}
for key, other_cols in col_dict.items():
if key not in alive_col_names:
continue
new_col_dict[key] = other_cols.interset(alive_col_names)
col_dict = new_col_dict
(Which can be collapsed using a dict comprehension but perhaps should not be, in the interest of readability.)
Upvotes: 1