I have to process a csv file with millions of lines "value, datetime": I need to find if the value shifted (drop or raise) by GAP_VALUE (first parameter ) within a given time period TIME_GAP (second parameter). I am not able to find an efficient algorithm to perform this analysis. I tried to analyze the csv file line by line: Choose (TIME_GAP, GAP_VALUE) parameters for the analysis. Creating the data subset Get the current datetime, the UPPER_TIME_BOUND Substract the TIME_GAP to find the LOWER_TIME_BOUND Extract all data >= LOWER_TIME_BOUND && <= UPPER_TIME_BOUND Detect the drop or the raise within the data subset (not sure about my method) Find MAX_VALUE and MIN_VALUE in the extracted dataset (not sure about that because of the data high fluctuation) Substract MIN_VALUE to MAX_VALUE to find the VALUE_SHIFT (the datetime order is important to detect if it is a drop or a raise) If the VALUE_SHIFT >= GAP_VALUE output the datetime when it occured and the shift value. Do the same for the next line in the csv file If (TIME_GAP, GAP_VALUE) parameters change, redo the test. My algorithm is highly resources-consuming. Could you help me to find a more efficient one please? Thank you, BR.

Reputation: 195

Efficient algorithm to find if a value changed dropped or raised in a given time period

I have to process a csv file with millions of lines "value, datetime": I need to find if the value shifted (drop or raise) by GAP_VALUE (first parameter ) within a given time period TIME_GAP (second parameter).

I am not able to find an efficient algorithm to perform this analysis.

I tried to analyze the csv file line by line:

Choose (TIME_GAP, GAP_VALUE) parameters for the analysis.
Creating the data subset

Get the current datetime, the UPPER_TIME_BOUND
Substract the TIME_GAP to find the LOWER_TIME_BOUND
Extract all data >= LOWER_TIME_BOUND && <= UPPER_TIME_BOUND

Detect the drop or the raise within the data subset (not sure about my method)

Find MAX_VALUE and MIN_VALUE in the extracted dataset (not sure about that because of the data high fluctuation)
Substract MIN_VALUE to MAX_VALUE to find the VALUE_SHIFT (the datetime order is important to detect if it is a drop or a raise)
If the VALUE_SHIFT >= GAP_VALUE output the datetime when it occured and the shift value.

Do the same for the next line in the csv file
If (TIME_GAP, GAP_VALUE) parameters change, redo the test.

My algorithm is highly resources-consuming. Could you help me to find a more efficient one please?

Thank you, BR.

Upvotes: 0

Answers (2)

gimix

Reputation: 3833

Why do you describe this as a 2-step process?

You have to read all the lines (if the lines were sorted by time and of fixed length you may try a binary search, since some seek calls could be dramatically faster than reading everything; but usually lines in a CSV file vary in length, so the only possibility is reading them all).

But while you read a line you can check whether it is in the requested time range, and if yes then you can update min_value and max_value, and the time they happened.

As soon as one of the two values changes, you check if the new delta is exceeding gap_value: if yes you can exit and return the two events, else you continue reading.

This is definitely O(n) in time, and O(1) in space.

Another thing: if the file is always the same, and continues increasing, while reading you could store elsewhere some "milestone" position, e.g. 00:00 of each day, or whatever. So at the next run you would not need to read from the beginning of file, but just from the appropriate milestone.

EDIT Seeing the comments by @Ext3h I think my answer is not clear enough. So here is a sample Python implementation:

def find_shifts(upper_time_bound, time_gap, gap_value, filename):
    #assume the file contains just the time point and the value in that order
    #assume the file is not sorted
    lower_time_bound = upper_time_bound - time_gap
    min_val = None
    with open(filename, newline='') as csvfile:
        reader = csv.reader(csvfile) 
        for row in reader:
            if lower_time_bound <= row[0] <= upper_time_bound:
                if not min_val: #we still need to init the bookkeeping vars
                    min_val = max_val = row[1]
                    min_val_time = max_val_time = row[0]
                else:
                    must_check = False
                    if row[1] < min_val:
                        must_check = True
                        min_val = row[1]
                        min_val_time = row[0]
                    elif row[1] > max_val:
                        must_check = True
                        max_val = row[1]
                        max_val_time = row[0]
                    if must_check:
                        if max_val - min_val > gap_value:
                            return min_val, min_val_time, nax_val, max_val_time
        return False #no shifts detected

Upvotes: 0

Ext3h

Reputation: 6391

You need a sorted list of counting buckets of values within the current sliding window. (E.g. a sorted_map<value, counter> buckets)

I assume you already know how to walk a sliding window over an already sorted set of entry lines in O(n), so I'm only going to describe the actions you need to take on the corresponding window updates.

after_increment_window:
    buckets.emplace_if_missing(window.newest.value, 0)
    ++buckets[window.newest.value]
    if(buckets.last.key - buckets.first.key > threshold)
        // You got a match.
before_decrement_window:
    --buckets[window.oldest.value]
    if(bucket[window.oldest.value] == 0)
        // This is where buckets.last and buckets.first can change.
        buckets.remove(window.oldest.value)

The requirement for a sorted list of buckets gives a runtime of O(n * log(m)) and a space complexity of O(m) with n being the total number of entry lines and m being the maximum number of unique values within the window size.

Upvotes: 0

Efficient algorithm to find if a value changed dropped or raised in a given time period

Answers (2)

Related Questions