Reputation: 195
I have to process a csv file with millions of lines "value, datetime": I need to find if the value shifted (drop or raise) by GAP_VALUE (first parameter ) within a given time period TIME_GAP (second parameter).
I am not able to find an efficient algorithm to perform this analysis.
I tried to analyze the csv file line by line:
Choose (TIME_GAP, GAP_VALUE) parameters for the analysis.
Creating the data subset
Do the same for the next line in the csv file
If (TIME_GAP, GAP_VALUE) parameters change, redo the test.
My algorithm is highly resources-consuming. Could you help me to find a more efficient one please?
Thank you, BR.
Upvotes: 0
Views: 74
Reputation: 3833
Why do you describe this as a 2-step process?
You have to read all the lines (if the lines were sorted by time and of fixed length you may try a binary search, since some seek
calls could be dramatically faster than reading everything; but usually lines in a CSV file vary in length, so the only possibility is reading them all).
But while you read a line you can check whether it is in the requested time range, and if yes then you can update min_value
and max_value
, and the time they happened.
As soon as one of the two values changes, you check if the new delta is exceeding gap_value
: if yes you can exit and return the two events, else you continue reading.
This is definitely O(n) in time, and O(1) in space.
Another thing: if the file is always the same, and continues increasing, while reading you could store elsewhere some "milestone" position, e.g. 00:00 of each day, or whatever. So at the next run you would not need to read from the beginning of file, but just from the appropriate milestone.
EDIT Seeing the comments by @Ext3h I think my answer is not clear enough. So here is a sample Python implementation:
def find_shifts(upper_time_bound, time_gap, gap_value, filename):
#assume the file contains just the time point and the value in that order
#assume the file is not sorted
lower_time_bound = upper_time_bound - time_gap
min_val = None
with open(filename, newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
if lower_time_bound <= row[0] <= upper_time_bound:
if not min_val: #we still need to init the bookkeeping vars
min_val = max_val = row[1]
min_val_time = max_val_time = row[0]
else:
must_check = False
if row[1] < min_val:
must_check = True
min_val = row[1]
min_val_time = row[0]
elif row[1] > max_val:
must_check = True
max_val = row[1]
max_val_time = row[0]
if must_check:
if max_val - min_val > gap_value:
return min_val, min_val_time, nax_val, max_val_time
return False #no shifts detected
Upvotes: 0
Reputation: 6391
You need a sorted list of counting buckets of values within the current sliding window. (E.g. a sorted_map<value, counter> buckets
)
I assume you already know how to walk a sliding window over an already sorted set of entry lines in O(n)
, so I'm only going to describe the actions you need to take on the corresponding window updates.
after_increment_window:
buckets.emplace_if_missing(window.newest.value, 0)
++buckets[window.newest.value]
if(buckets.last.key - buckets.first.key > threshold)
// You got a match.
before_decrement_window:
--buckets[window.oldest.value]
if(bucket[window.oldest.value] == 0)
// This is where buckets.last and buckets.first can change.
buckets.remove(window.oldest.value)
The requirement for a sorted list of buckets gives a runtime of O(n * log(m))
and a space complexity of O(m)
with n
being the total number of entry lines and m
being the maximum number of unique values within the window size.
Upvotes: 0