Vishal Kalathil
Vishal Kalathil

Reputation: 1

How do I generate a diff between a file and an incomplete patch of it

I'm working on a project that requires me build an interface between an LLM and the user. The LLM is used for code fix generation and I need to output the diff between the generated code and the contents of a given C file. The problem arises when the generated code has ellipses in it to indicate a block of unchanged code, this breaks the diff.

So far I have tried using Levenshtein Distance to find similar lines and find the diff between them, this approach however doesn't seem to work when new lines are being added. This is how I went about implementing it

import difflib
from rapidfuzz import fuzz


def compare_code_with_diffs(
    original_code, fixed_code, similarity_threshold=0.7
):
    def string_similarity(s1, s2):
        # Remove spaces and convert to lowercase for a more robust comparison
        s1 = s1.replace(" ", "").lower()
        s2 = s2.replace(" ", "").lower()
        return (
            fuzz.ratio(s1, s2) / 100.0
        )  # Convert percentage to a float between 0 and 1

    def find_matching_indices(A, B):
        result = []
        for i, a_item in enumerate(A):
            best_match = None
            best_similarity = 0
            for j, b_item in enumerate(B):
                similarity = string_similarity(a_item, b_item)
                if similarity > best_similarity:
                    best_similarity = similarity
                    best_match = (j, b_item, similarity)

            if best_match and best_similarity >= similarity_threshold:
                result.append(
                    (i, best_match[0], a_item, best_match[1], best_similarity)
                )

        return result

    def generate_diff(old_line, new_line):
        differ = difflib.Differ()
        diff = list(differ.compare([old_line], [new_line]))
        return "\n".join(diff)

    A = fixed_code.splitlines()
    B = original_code.splitlines()
    matches = find_matching_indices(A, B)

    diffs = []
    for match in matches:
        if similarity_threshold < match[4] < 1.0:
            diff = generate_diff(match[3], match[2])
            diffs.append(
                {
                    "fixed_index": match[0],
                    "original_index": match[1],
                    "similarity": match[4],
                    "diff": diff,
                }
            )

    return diffs

For example, if I had the file

#include <stdio.h>
 
int main()
{
    int arr_int[5];
    return 0;
}

and the code suggestion is

...
int main()
{
    int arr_int[5];
    char arr_char[5];
    ...
}

The output should only know the insertion of char arr_char[5];

Upvotes: 0

Views: 66

Answers (0)

Related Questions