Reputation: 1
I'm working on a project that requires me build an interface between an LLM and the user. The LLM is used for code fix generation and I need to output the diff between the generated code and the contents of a given C file. The problem arises when the generated code has ellipses in it to indicate a block of unchanged code, this breaks the diff.
So far I have tried using Levenshtein Distance to find similar lines and find the diff between them, this approach however doesn't seem to work when new lines are being added. This is how I went about implementing it
import difflib
from rapidfuzz import fuzz
def compare_code_with_diffs(
original_code, fixed_code, similarity_threshold=0.7
):
def string_similarity(s1, s2):
# Remove spaces and convert to lowercase for a more robust comparison
s1 = s1.replace(" ", "").lower()
s2 = s2.replace(" ", "").lower()
return (
fuzz.ratio(s1, s2) / 100.0
) # Convert percentage to a float between 0 and 1
def find_matching_indices(A, B):
result = []
for i, a_item in enumerate(A):
best_match = None
best_similarity = 0
for j, b_item in enumerate(B):
similarity = string_similarity(a_item, b_item)
if similarity > best_similarity:
best_similarity = similarity
best_match = (j, b_item, similarity)
if best_match and best_similarity >= similarity_threshold:
result.append(
(i, best_match[0], a_item, best_match[1], best_similarity)
)
return result
def generate_diff(old_line, new_line):
differ = difflib.Differ()
diff = list(differ.compare([old_line], [new_line]))
return "\n".join(diff)
A = fixed_code.splitlines()
B = original_code.splitlines()
matches = find_matching_indices(A, B)
diffs = []
for match in matches:
if similarity_threshold < match[4] < 1.0:
diff = generate_diff(match[3], match[2])
diffs.append(
{
"fixed_index": match[0],
"original_index": match[1],
"similarity": match[4],
"diff": diff,
}
)
return diffs
For example, if I had the file
#include <stdio.h>
int main()
{
int arr_int[5];
return 0;
}
and the code suggestion is
...
int main()
{
int arr_int[5];
char arr_char[5];
...
}
The output should only know the insertion of char arr_char[5];
Upvotes: 0
Views: 66