Python: Compare html tags in RO folder with their corresponding tags in EN folder and displays in Output the unique tags from both files

Question

In short, I have two files, one in Romanian, the other has been translated into English. In the RO file there are some tags that have not been translated into EN. So I want to display in an html output all the tags in EN that have corresponding tags in RO, but also those tags in RO that do not appear in EN.

I have this files:

   ro_file_path = r'd:\3
o\incotro-vezi-tu-privire.html'
   en_file_path = r'd:\3\en\where-do-you-see-look.html'
   Output =  d:\3\Output\where-do-you-see-look.html

TASK: Compare the 3 tags below, in both files.

(.*?)
(.*?)
(.*?)(.*?)

Requirements:

All tags are enclosed between: and
Count the tags in RO and count the tags in EN, and compare.
Then count the words in the tags in RO and compare with the number of words in the tags in EN.
Compares the html tags in RO with the html tags in EN, in order, and displays in Output the unique tags from both files

RO d:\3 o\incotro-vezi-tu-privire.html

 
Stiu ca este dificil sa conduci la inceput, dar dupa 4-5 luni inveti. 
Imi place sa merg la scoala si sa invat, mai ales in timpul saptamanii. 
Sunt un bun conducator auto, dar am facut si greseli din care am invatat. 
În fond, cele scrise de mine, sunt adevarate. 
Iubesc sa conduc masina. 
Ma iubesti? 
Stiu ca este dificil sa conduci la inceput, dar dupa 4-5 luni inveti. 
Totul se repetă, chiar și ochii care nu se vad. 
BEE servesc o cafea 2 mai buna

EN d:\3\en\where-do-you-see-look.html


I like going to school and learning, especially during the week.
I'm a good driver, but I've also made mistakes that I've learned from.
Basically, what I wrote is true.
I love driving.
I know it's difficult to drive at first,  but after 4-5 months you learn.
Everything is repeated, even the eyes that can't see.

Expected OUTPUT: d:\3\Output\where-do-you-see-look.html


Stiu ca este dificil sa conduci la inceput,  dar dupa 4-5 luni inveti. 
I like going to school and learning, especially during the week.
I'm a good driver, but I've also made mistakes that I've learned from.
Basically, what I wrote is true.
Ma iubesti? 
I love driving.
I know it's difficult to drive at first,  but after 4-5 months you learn.
Everything is repeated, even the eyes that can't see.
BEE servesc o cafea 2 mai buna

Python code must compares the html tags in RO with the html tags in EN and displays in Output the unique tags in both files, taking into account that most of the tags in RO have their corresponding translation in the tags in EN. But the idea of the code is that the code also finds those html tags in RO that were omitted from being translated into EN.

Here's how I came up with the solution in Python code. I followed a simple calculation.

First method:

First, you have to count all the tags in ro, then all the tags in en. Then you have to memorize each type of tag in ro, but then also in en. Then you have to count the words in each tag in ro and the words in each tag in en. Don't forget that there can be 2 identical tags, but on different lines, just like in RO. Then you have to statistically calculate the result. How much are the tags in ro minus the tags in en?

The second method, to verify the output, is to print the screen. Compare the entire ro part and the entire en part separately through OCR, then line by line, see which tags in ro are plus compared to the tags in en

PYTHON CODE:

import re
import os

def extract_tags(content):
    start = content.find('')
    end = content.find('')
    if start == -1 or end == -1:
        raise ValueError("Marcajele 'ARTICOL START' sau 'ARTICOL FINAL' lipsesc.")

    section_content = content[start:end]
    pattern = re.compile(r'(?:)?.*?', re.DOTALL)
    tags = []

    for idx, match in enumerate(pattern.finditer(section_content), 1):
        tag = match.group(0)
        text = re.sub(r'<[^>]+>', '', tag).strip()

        if '' in tag or '' in tag:
            tag_type = 'span'
        elif 'class="text_obisnuit2"' in tag:
            tag_type = 'text_obisnuit2'
        else:
            tag_type = 'text_obisnuit'

        tags.append({
            'index': idx,
            'tag': tag,
            'text': text,
            'type': tag_type,
            'word_count': len(text.split())
        })
    return tags

def find_matching_pairs(ro_tags, en_tags):
    matched_indices = set()
    used_en = set()

    for i, ro_tag in enumerate(ro_tags):
        for j, en_tag in enumerate(en_tags):
            if j in used_en:
                continue

            if ro_tag['type'] == en_tag['type']:
                word_diff = abs(ro_tag['word_count'] - en_tag['word_count'])
                if word_diff <= 3:
                    matched_indices.add(i)
                    used_en.add(j)
                    break
    return matched_indices

def fix_duplicates(output_content, ro_content):
    """Corectează poziția tag-urilor duplicate"""
    ro_tags = extract_tags(ro_content)
    output_tags = extract_tags(output_content)

    # Găsim tag-urile care apar în RO și OUTPUT
    for ro_idx, ro_tag in enumerate(ro_tags):
        for out_idx, out_tag in enumerate(output_tags):
            if ro_tag['tag'] == out_tag['tag'] and ro_idx != out_idx:
                # Am găsit un tag care apare în poziții diferite
                # Verificăm dacă este cazul de duplicat care trebuie mutat
                ro_lines = ro_content.split('
')
                out_lines = output_content.split('
')

                if ro_tag['tag'] in ro_lines[ro_idx+1] and out_tag['tag'] in out_lines[out_idx+1]:
                    # Mutăm tag-ul la poziția corectă
                    out_lines.remove(out_tag['tag'])
                    out_lines.insert(ro_idx+1, out_tag['tag'])
                    output_content = '
'.join(out_lines)
                    break

    return output_content

def generate_output(ro_tags, en_tags, original_content):
    start = original_content.find('')
    end = original_content.find('')
    if start == -1 or end == -1:
        raise ValueError("Marcajele 'ARTICOL START' sau 'ARTICOL FINAL' lipsesc.")

    output_content = original_content[:start + len('')] + "
"
    matched_indices = find_matching_pairs(ro_tags, en_tags)
    en_index = 0

    for i, ro_tag in enumerate(ro_tags):
        if i in matched_indices:
            output_content += en_tags[en_index]['tag'] + "
"
            en_index += 1
        else:
            output_content += ro_tag['tag'] + "
"

    while en_index < len(en_tags):
        output_content += en_tags[en_index]['tag'] + "
"
        en_index += 1

    output_content += original_content[end:]
    return output_content

def main():
    try:
        ro_file_path = r'd:\3
o\incotro-vezi-tu-privire.html'
        en_file_path = r'd:\3\en\where-do-you-see-look.html'
        output_file_path = r'd:\3\Output\where-do-you-see-look.html'

        with open(ro_file_path, 'r', encoding='utf-8') as ro_file:
            ro_content = ro_file.read()
        with open(en_file_path, 'r', encoding='utf-8') as en_file:
            en_content = en_file.read()

        ro_tags = extract_tags(ro_content)
        en_tags = extract_tags(en_content)

        # Generăm primul output
        initial_output = generate_output(ro_tags, en_tags, en_content)

        # Corectăm pozițiile tag-urilor duplicate
        final_output = fix_duplicates(initial_output, ro_content)

        with open(output_file_path, 'w', encoding='utf-8') as output_file:
            output_file.write(final_output)

        print(f"Output-ul a fost generat la {output_file_path}")

    except Exception as e:
        print(f"Eroare: {str(e)}")

if __name__ == "__main__":
    main()

My Python code is almost perfect, but not perfect. The problem occurs when I introduce other tags in RO, such as:

 
Laptopul meu este de culoare neagra.
Imi place sa merg la scoala si sa invat, mai ales in timpul saptamanii. 
Sunt un bun conducator auto, dar am facut si greseli din care am invatat. 
Stiu ca este dificil sa conduci la inceput, dar dupa 4-5 luni inveti.
În fond, cele scrise de mine, sunt adevarate. 
Iubesc sa conduc masina. 

Stiu ca este dificil sa conduci la inceput, dar dupa 4-5 luni inveti.
Totul se repetă, chiar și ochii care nu se vad.

Just Me · Accepted Answer

SECOND, and the BEST SOLUTION.

Finally I solved the problem, but not with ChatGPT or Claude. No other AI could find the solution, because it didn't know how to think about the solution.

In fact, to find the solution to this problem, you had to assign some identifiers to each tag, and do multiple searches.

ChatGPT or Claude, or other AIs, will have to seriously consider this type of solution for such problems.

Here are the specifications, the way I thought about solving the problem. It's a different way of thinking about doing PARSINGS.

https://pastebin.com/as2yw1UQ

Python code made by a friend of mine. I think the solution, he made the code:

from bs4 import BeautifulSoup
import re

def count_words(text):
    """Numără cuvintele dintr-un text."""
    return len(text.strip().split())

def get_greek_identifier(word_count):
    """Determină identificatorul grecesc bazat pe numărul de cuvinte."""
    if word_count < 7:
        return 'α'
    elif word_count <= 14:
        return 'β'
    else:
        return 'γ'

def get_tag_type(tag):
    """Determină tipul tagului (A, B, sau C)."""
    if tag.find('span'):
        return 'A'
    elif 'text_obisnuit2' in tag.get('class', []):
        return 'B'
    return 'C'

def analyze_tags(content):
    """Analizează tagurile și returnează informații despre fiecare tag."""
    soup = BeautifulSoup(content, 'html.parser')
    tags_info = []

    article_content = re.search(r'(.*?)',
                              content, re.DOTALL)

    if article_content:
        content = article_content.group(1)
        soup = BeautifulSoup(content, 'html.parser')

    for i, tag in enumerate(soup.find_all('p', recursive=False)):
        text_content = tag.get_text(strip=True)
        tag_type = get_tag_type(tag)
        word_count = count_words(text_content)
        greek_id = get_greek_identifier(word_count)

        tags_info.append({
            'number': i + 1,
            'type': tag_type,
            'greek': greek_id,
            'content': str(tag),
            'text': text_content
        })

    return tags_info

def compare_tags(ro_tags, en_tags):
    """Compară tagurile și găsește diferențele."""
    wrong_tags = []
    i = 0
    j = 0

    while i < len(ro_tags):
        ro_tag = ro_tags[i]
        if j >= len(en_tags):
            wrong_tags.append(ro_tag)
            i += 1
            continue

        en_tag = en_tags[j]

        if ro_tag['type'] != en_tag['type']:
            wrong_tags.append(ro_tag)
            i += 1
            continue

        i += 1
        j += 1

    return wrong_tags

def format_results(wrong_tags):
    """Formatează rezultatele pentru afișare și salvare."""
    type_counts = {'A': 0, 'B': 0, 'C': 0}
    type_content = {'A': [], 'B': [], 'C': []}

    for tag in wrong_tags:
        type_counts[tag['type']] += 1
        type_content[tag['type']].append(tag['content'])

    # Creăm rezultatul formatat
    result = []

    # Prima linie cu sumarul
    summary_parts = []
    for tag_type in ['A', 'B', 'C']:
        if type_counts[tag_type] > 0:
            summary_parts.append(f"{type_counts[tag_type]} taguri de tipul ({tag_type})")
    result.append("In RO exista in plus fata de EN urmatoarele: " + " si ".join(summary_parts))

    # Detaliile pentru fiecare tip de tag
    for tag_type in ['A', 'B', 'C']:
        if type_counts[tag_type] > 0:
            result.append(f"
{type_counts[tag_type]}({tag_type}) adica asta {'taguri' if type_counts[tag_type] > 1 else 'tag'}:")
            for content in type_content[tag_type]:
                result.append(content)
            result.append("")  # Linie goală pentru separare

    return "
".join(result)

def merge_content(ro_tags, en_tags, wrong_tags):
    """Combină conținutul RO și EN, inserând tagurile wrong în pozițiile lor originale."""
    merged_tags = []

    # Creăm un dicționar pentru tagurile wrong indexat după numărul lor original
    wrong_dict = {tag['number']: tag for tag in wrong_tags}

    # Parcurgem pozițiile și decidem ce tag să punem în fiecare poziție
    current_en_idx = 0
    for i in range(max(len(ro_tags), len(en_tags))):
        position = i + 1

        # Verificăm dacă această poziție este pentru un tag wrong
        if position in wrong_dict:
            merged_tags.append(wrong_dict[position]['content'])
        elif current_en_idx < len(en_tags):
            merged_tags.append(en_tags[current_en_idx]['content'])
            current_en_idx += 1

    return merged_tags

def save_results(merged_content, results, output_path):
    """Salvează conținutul combinat și rezultatele în fișierul de output."""
    final_content = '
'
    final_content += '
'

    # Adaugă conținutul combinat
    for tag in merged_content:
        final_content += tag + '
'

    final_content += '
'
    final_content += '
'

    # Adaugă rezultatele analizei
    final_content += results

    # Salvează în fișier
    with open(output_path, 'w', encoding='utf-8') as file:
        file.write(final_content)

# Citește fișierele
with open(r'd:/3/ro/incotro-vezi-tu-privire.html', 'r', encoding='utf-8') as file:
    ro_content = file.read()

with open(r'd:/3/en/where-do-you-see-look.html', 'r', encoding='utf-8') as file:
    en_content = file.read()

# Definește calea pentru fișierul de output
output_path = r'd:/3/Output/where-do-you-see-look.html'

# Analizează tagurile
ro_tags = analyze_tags(ro_content)
en_tags = analyze_tags(en_content)

# Găsește diferențele
wrong_tags = compare_tags(ro_tags, en_tags)

# Formatează rezultatele
results = format_results(wrong_tags)

# Generează conținutul combinat
merged_content = merge_content(ro_tags, en_tags, wrong_tags)

# Afișează rezultatele în consolă
print(results)

# Salvează rezultatele în fișierul de output
save_results(merged_content, results, output_path)

Python: Compare html tags in RO folder with their corresponding tags in EN folder and displays in Output the unique tags from both files

RO d:\3\ro\incotro-vezi-tu-privire.html

EN d:\3\en\where-do-you-see-look.html

Expected OUTPUT: d:\3\Output\where-do-you-see-look.html

Answers (2)

Related Questions