Just Me
Just Me

Reputation: 1063

Python: Compare html tags in RO folder with their corresponding tags in EN folder and displays in Output the unique tags from both files

In short, I have two files, one in Romanian, the other has been translated into English. In the RO file there are some tags that have not been translated into EN. So I want to display in an html output all the tags in EN that have corresponding tags in RO, but also those tags in RO that do not appear in EN.

I have this files:

   ro_file_path = r'd:\3\ro\incotro-vezi-tu-privire.html'
   en_file_path = r'd:\3\en\where-do-you-see-look.html'
   Output =  d:\3\Output\where-do-you-see-look.html 

TASK: Compare the 3 tags below, in both files.

<p class="text_obisnuit">(.*?)</p>
<p class="text_obisnuit2">(.*?)</p>
<p class="text_obisnuit"><span class="text_obisnuit2">(.*?)</span>(.*?)</p>

Requirements:

RO d:\3\ro\incotro-vezi-tu-privire.html

<!-- ARTICOL START --> 
<p class="text_obisnuit"><span class="text_obisnuit2">Stiu ca este dificil sa conduci la inceput, </span>dar dupa 4-5 luni inveti.</p> 
<p class="text_obisnuit2">Imi place sa merg la scoala si sa invat, mai ales in timpul saptamanii.</p> 
<p class="text_obisnuit">Sunt un bun conducator auto, dar am facut si greseli din care am invatat.</p> 
<p class="text_obisnuit">În fond, cele scrise de mine, sunt adevarate.</p> 
<p class="text_obisnuit">Iubesc sa conduc masina.</p> 
<p class="text_obisnuit"><span class="text_obisnuit2">Ma iubesti?</p> 
<p class="text_obisnuit"><span class="text_obisnuit2">Stiu ca este dificil sa conduci la inceput, </span>dar dupa 4-5 luni inveti.</p> 
<p class="text_obisnuit">Totul se repetă, chiar și ochii care nu se vad.</p> 
<p class="text_obisnuit2">BEE servesc o cafea 2 mai buna</p> 
<!-- ARTICOL FINAL -->

   

EN d:\3\en\where-do-you-see-look.html

<!-- ARTICOL START -->
<p class="text_obisnuit2">I like going to school and learning, especially during the week.</p>
<p class="text_obisnuit">I'm a good driver, but I've also made mistakes that I've learned from.</p>
<p class="text_obisnuit">Basically, what I wrote is true.</p>
<p class="text_obisnuit">I love driving.</p>
<p class="text_obisnuit"><span class="text_obisinuit2">I know it's difficult to drive at first, </span> but after 4-5 months you learn.</p>
<p class="text_obisnuit">Everything is repeated, even the eyes that can't see.</p>
<!-- ARTICOL FINAL -->

Expected OUTPUT: d:\3\Output\where-do-you-see-look.html

<!-- ARTICOL START -->
<p class="text_obisnuit"><span class="text_obisnuit2">Stiu ca este dificil sa conduci la inceput, </span> dar dupa 4-5 luni inveti.</p> 
<p class="text_obisnuit2">I like going to school and learning, especially during the week.</p>
<p class="text_obisnuit">I'm a good driver, but I've also made mistakes that I've learned from.</p>
<p class="text_obisnuit">Basically, what I wrote is true.</p>
<p class="text_obisnuit"><span class="text_obisnuit2">Ma iubesti?</p> 
<p class="text_obisnuit">I love driving.</p>
<p class="text_obisnuit"><span class="text_obisinuit2">I know it's difficult to drive at first, </span> but after 4-5 months you learn.</p>
<p class="text_obisnuit">Everything is repeated, even the eyes that can't see.</p>
<p class="text_obisnuit2">BEE servesc o cafea 2 mai buna</p> 
<!-- ARTICOL FINAL -->

Python code must compares the html tags in RO with the html tags in EN and displays in Output the unique tags in both files, taking into account that most of the tags in RO have their corresponding translation in the tags in EN. But the idea of ​​the code is that the code also finds those html tags in RO that were omitted from being translated into EN.

Here's how I came up with the solution in Python code. I followed a simple calculation.

First method:

First, you have to count all the tags in ro, then all the tags in en. Then you have to memorize each type of tag in ro, but then also in en. Then you have to count the words in each tag in ro and the words in each tag in en. Don't forget that there can be 2 identical tags, but on different lines, just like in RO. Then you have to statistically calculate the result. How much are the tags in ro minus the tags in en?

The second method, to verify the output, is to print the screen. Compare the entire ro part and the entire en part separately through OCR, then line by line, see which tags in ro are plus compared to the tags in en

PYTHON CODE:

import re
import os

def extract_tags(content):
    start = content.find('<!-- ARTICOL START -->')
    end = content.find('<!-- ARTICOL FINAL -->')
    if start == -1 or end == -1:
        raise ValueError("Marcajele 'ARTICOL START' sau 'ARTICOL FINAL' lipsesc.")

    section_content = content[start:end]
    pattern = re.compile(r'<p class="text_obisnuit(?:2)?">(?:<span class="text_obisnuit2">)?.*?</p>', re.DOTALL)
    tags = []

    for idx, match in enumerate(pattern.finditer(section_content), 1):
        tag = match.group(0)
        text = re.sub(r'<[^>]+>', '', tag).strip()

        if '<span class="text_obisnuit2">' in tag or '<span class="text_obisinuit2">' in tag:
            tag_type = 'span'
        elif 'class="text_obisnuit2"' in tag:
            tag_type = 'text_obisnuit2'
        else:
            tag_type = 'text_obisnuit'

        tags.append({
            'index': idx,
            'tag': tag,
            'text': text,
            'type': tag_type,
            'word_count': len(text.split())
        })
    return tags

def find_matching_pairs(ro_tags, en_tags):
    matched_indices = set()
    used_en = set()

    for i, ro_tag in enumerate(ro_tags):
        for j, en_tag in enumerate(en_tags):
            if j in used_en:
                continue

            if ro_tag['type'] == en_tag['type']:
                word_diff = abs(ro_tag['word_count'] - en_tag['word_count'])
                if word_diff <= 3:
                    matched_indices.add(i)
                    used_en.add(j)
                    break
    return matched_indices

def fix_duplicates(output_content, ro_content):
    """Corectează poziția tag-urilor duplicate"""
    ro_tags = extract_tags(ro_content)
    output_tags = extract_tags(output_content)

    # Găsim tag-urile care apar în RO și OUTPUT
    for ro_idx, ro_tag in enumerate(ro_tags):
        for out_idx, out_tag in enumerate(output_tags):
            if ro_tag['tag'] == out_tag['tag'] and ro_idx != out_idx:
                # Am găsit un tag care apare în poziții diferite
                # Verificăm dacă este cazul de duplicat care trebuie mutat
                ro_lines = ro_content.split('\n')
                out_lines = output_content.split('\n')

                if ro_tag['tag'] in ro_lines[ro_idx+1] and out_tag['tag'] in out_lines[out_idx+1]:
                    # Mutăm tag-ul la poziția corectă
                    out_lines.remove(out_tag['tag'])
                    out_lines.insert(ro_idx+1, out_tag['tag'])
                    output_content = '\n'.join(out_lines)
                    break

    return output_content

def generate_output(ro_tags, en_tags, original_content):
    start = original_content.find('<!-- ARTICOL START -->')
    end = original_content.find('<!-- ARTICOL FINAL -->')
    if start == -1 or end == -1:
        raise ValueError("Marcajele 'ARTICOL START' sau 'ARTICOL FINAL' lipsesc.")

    output_content = original_content[:start + len('<!-- ARTICOL START -->')] + "\n"
    matched_indices = find_matching_pairs(ro_tags, en_tags)
    en_index = 0

    for i, ro_tag in enumerate(ro_tags):
        if i in matched_indices:
            output_content += en_tags[en_index]['tag'] + "\n"
            en_index += 1
        else:
            output_content += ro_tag['tag'] + "\n"

    while en_index < len(en_tags):
        output_content += en_tags[en_index]['tag'] + "\n"
        en_index += 1

    output_content += original_content[end:]
    return output_content

def main():
    try:
        ro_file_path = r'd:\3\ro\incotro-vezi-tu-privire.html'
        en_file_path = r'd:\3\en\where-do-you-see-look.html'
        output_file_path = r'd:\3\Output\where-do-you-see-look.html'

        with open(ro_file_path, 'r', encoding='utf-8') as ro_file:
            ro_content = ro_file.read()
        with open(en_file_path, 'r', encoding='utf-8') as en_file:
            en_content = en_file.read()

        ro_tags = extract_tags(ro_content)
        en_tags = extract_tags(en_content)

        # Generăm primul output
        initial_output = generate_output(ro_tags, en_tags, en_content)

        # Corectăm pozițiile tag-urilor duplicate
        final_output = fix_duplicates(initial_output, ro_content)

        with open(output_file_path, 'w', encoding='utf-8') as output_file:
            output_file.write(final_output)

        print(f"Output-ul a fost generat la {output_file_path}")

    except Exception as e:
        print(f"Eroare: {str(e)}")

if __name__ == "__main__":
    main()

My Python code is almost perfect, but not perfect. The problem occurs when I introduce other tags in RO, such as:

<!-- ARTICOL START --> 
<p class="text_obisnuit">Laptopul meu este de culoare neagra.</p>
<p class="text_obisnuit2">Imi place sa merg la scoala si sa invat, mai ales in timpul saptamanii.</p> 
<p class="text_obisnuit">Sunt un bun conducator auto, dar am facut si greseli din care am invatat.</p> 
<p class="text_obisnuit"><span class="text_obisnuit2">Stiu ca este dificil sa conduci la inceput, </span>dar dupa 4-5 luni inveti.</p>
<p class="text_obisnuit">În fond, cele scrise de mine, sunt adevarate.</p> 
<p class="text_obisnuit">Iubesc sa conduc masina.</p> 

<p class="text_obisnuit"><span class="text_obisnuit2">Stiu ca este dificil sa conduci la inceput, </span>dar dupa 4-5 luni inveti.</p>
<p class="text_obisnuit">Totul se repetă, chiar și ochii care nu se vad.</p> 

<!-- ARTICOL FINAL -->

Upvotes: 0

Views: 53

Answers (2)

Just Me
Just Me

Reputation: 1063

SECOND, and the BEST SOLUTION.

Finally I solved the problem, but not with ChatGPT or Claude. No other AI could find the solution, because it didn't know how to think about the solution.

In fact, to find the solution to this problem, you had to assign some identifiers to each tag, and do multiple searches.

ChatGPT or Claude, or other AIs, will have to seriously consider this type of solution for such problems.

Here are the specifications, the way I thought about solving the problem. It's a different way of thinking about doing PARSINGS.

https://pastebin.com/as2yw1UQ

Python code made by a friend of mine. I think the solution, he made the code:

from bs4 import BeautifulSoup
import re

def count_words(text):
    """Numără cuvintele dintr-un text."""
    return len(text.strip().split())

def get_greek_identifier(word_count):
    """Determină identificatorul grecesc bazat pe numărul de cuvinte."""
    if word_count < 7:
        return 'α'
    elif word_count <= 14:
        return 'β'
    else:
        return 'γ'

def get_tag_type(tag):
    """Determină tipul tagului (A, B, sau C)."""
    if tag.find('span'):
        return 'A'
    elif 'text_obisnuit2' in tag.get('class', []):
        return 'B'
    return 'C'

def analyze_tags(content):
    """Analizează tagurile și returnează informații despre fiecare tag."""
    soup = BeautifulSoup(content, 'html.parser')
    tags_info = []

    article_content = re.search(r'<!-- ARTICOL START -->(.*?)<!-- ARTICOL FINAL -->',
                              content, re.DOTALL)

    if article_content:
        content = article_content.group(1)
        soup = BeautifulSoup(content, 'html.parser')

    for i, tag in enumerate(soup.find_all('p', recursive=False)):
        text_content = tag.get_text(strip=True)
        tag_type = get_tag_type(tag)
        word_count = count_words(text_content)
        greek_id = get_greek_identifier(word_count)

        tags_info.append({
            'number': i + 1,
            'type': tag_type,
            'greek': greek_id,
            'content': str(tag),
            'text': text_content
        })

    return tags_info

def compare_tags(ro_tags, en_tags):
    """Compară tagurile și găsește diferențele."""
    wrong_tags = []
    i = 0
    j = 0

    while i < len(ro_tags):
        ro_tag = ro_tags[i]
        if j >= len(en_tags):
            wrong_tags.append(ro_tag)
            i += 1
            continue

        en_tag = en_tags[j]

        if ro_tag['type'] != en_tag['type']:
            wrong_tags.append(ro_tag)
            i += 1
            continue

        i += 1
        j += 1

    return wrong_tags

def format_results(wrong_tags):
    """Formatează rezultatele pentru afișare și salvare."""
    type_counts = {'A': 0, 'B': 0, 'C': 0}
    type_content = {'A': [], 'B': [], 'C': []}

    for tag in wrong_tags:
        type_counts[tag['type']] += 1
        type_content[tag['type']].append(tag['content'])

    # Creăm rezultatul formatat
    result = []

    # Prima linie cu sumarul
    summary_parts = []
    for tag_type in ['A', 'B', 'C']:
        if type_counts[tag_type] > 0:
            summary_parts.append(f"{type_counts[tag_type]} taguri de tipul ({tag_type})")
    result.append("In RO exista in plus fata de EN urmatoarele: " + " si ".join(summary_parts))

    # Detaliile pentru fiecare tip de tag
    for tag_type in ['A', 'B', 'C']:
        if type_counts[tag_type] > 0:
            result.append(f"\n{type_counts[tag_type]}({tag_type}) adica asta {'taguri' if type_counts[tag_type] > 1 else 'tag'}:")
            for content in type_content[tag_type]:
                result.append(content)
            result.append("")  # Linie goală pentru separare

    return "\n".join(result)

def merge_content(ro_tags, en_tags, wrong_tags):
    """Combină conținutul RO și EN, inserând tagurile wrong în pozițiile lor originale."""
    merged_tags = []

    # Creăm un dicționar pentru tagurile wrong indexat după numărul lor original
    wrong_dict = {tag['number']: tag for tag in wrong_tags}

    # Parcurgem pozițiile și decidem ce tag să punem în fiecare poziție
    current_en_idx = 0
    for i in range(max(len(ro_tags), len(en_tags))):
        position = i + 1

        # Verificăm dacă această poziție este pentru un tag wrong
        if position in wrong_dict:
            merged_tags.append(wrong_dict[position]['content'])
        elif current_en_idx < len(en_tags):
            merged_tags.append(en_tags[current_en_idx]['content'])
            current_en_idx += 1

    return merged_tags

def save_results(merged_content, results, output_path):
    """Salvează conținutul combinat și rezultatele în fișierul de output."""
    final_content = '<!-- REZULTATE ANALIZA -->\n'
    final_content += '<!-- ARTICOL START -->\n'

    # Adaugă conținutul combinat
    for tag in merged_content:
        final_content += tag + '\n'

    final_content += '<!-- ARTICOL FINAL -->\n'
    final_content += '<!-- FINAL REZULTATE ANALIZA -->\n'

    # Adaugă rezultatele analizei
    final_content += results

    # Salvează în fișier
    with open(output_path, 'w', encoding='utf-8') as file:
        file.write(final_content)

# Citește fișierele
with open(r'd:/3/ro/incotro-vezi-tu-privire.html', 'r', encoding='utf-8') as file:
    ro_content = file.read()

with open(r'd:/3/en/where-do-you-see-look.html', 'r', encoding='utf-8') as file:
    en_content = file.read()

# Definește calea pentru fișierul de output
output_path = r'd:/3/Output/where-do-you-see-look.html'

# Analizează tagurile
ro_tags = analyze_tags(ro_content)
en_tags = analyze_tags(en_content)

# Găsește diferențele
wrong_tags = compare_tags(ro_tags, en_tags)

# Formatează rezultatele
results = format_results(wrong_tags)

# Generează conținutul combinat
merged_content = merge_content(ro_tags, en_tags, wrong_tags)

# Afișează rezultatele în consolă
print(results)

# Salvează rezultatele în fișierul de output
save_results(merged_content, results, output_path)

Upvotes: 0

Just Me
Just Me

Reputation: 1063

A friend made the code for me, it was quite difficult to find the solution.

But it must be taken into account that not every variant of tag combinations will give a perfect result. Because there must be a series of similar lines for it to work, it is the theory of probability. The code below is good for 90% of the cases.

Otherwise, look for a python library that parses html/xml and can extract the schema/structure of the html. Then try to compare tags at the same level. It should be simpler. HERE:

import re
import copy
import os


def replace_delimited_text(text, start_delimiter, end_delimiter, replacement):
    pattern = re.escape(start_delimiter) + r".*?" + re.escape(end_delimiter)
    replaced_text = re.sub(pattern, replacement, text, flags=re.DOTALL)
    print("[replace_delimited_text] Text replacement completed.")
    return replaced_text


def extract_tags_from_file(file_path):
    print(f"[extract_tags_from_file] Reading file: {file_path}")
    with open(file_path, "r", encoding="utf-8") as file:
        html_content = file.read()

    combined_pattern = (
        r'<p class="text_obisnuit">.*?</p>|'
        r'<p class="text_obisnuit2">.*?</p>|'
        r'<p class="text_obisnuit"><span class="text_obisnuit2">.*?</span></p>'
    )

    all_matches = re.findall(combined_pattern, html_content, re.DOTALL)
    print(f"[extract_tags_from_file] Found {len(all_matches)} matches in {file_path}")
    return all_matches


def is_significant_length_difference(text1, text2, threshold=0.2):
    length_diff = abs(len(text1) - len(text2))
    is_different = (length_diff / max(len(text1), len(text2))) > threshold
    print(f"[is_significant_length_difference] Difference: {is_different}")
    return is_different


if __name__ == "__main__":
    ro_file_path = "d:/3/ro/incotro-vezi-tu-privire.html"
    en_file_path = "d:/3/en/where-do-you-see-look.html"

    ro_tags = extract_tags_from_file(ro_file_path)
    en_tags = extract_tags_from_file(en_file_path)

    print("[MAIN] RO tag count:", len(ro_tags), "EN tag count:", len(en_tags))

    if len(ro_tags) <= len(en_tags):
        raise Exception("There's nothing to transfer from the RO HTML article to the EN one!")

    final_en_tags = copy.deepcopy(en_tags)
    inserted_at = []
    i = 0

    while i < len(ro_tags):
        # Verificăm dacă i este încă în limitele final_en_tags
        if i < len(final_en_tags):
            # Verificăm dacă tagurile au aceeași structură
            if (
                (ro_tags[i].startswith('<p class="text_obisnuit">') and
                 final_en_tags[i].startswith('<p class="text_obisnuit">')) or
                (ro_tags[i].startswith('<p class="text_obisnuit2">') and
                 final_en_tags[i].startswith('<p class="text_obisnuit2">')) or
                (ro_tags[i].startswith('<p class="text_obisnuit"><span class="text_obisnuit2">') and
                 final_en_tags[i].startswith('<p class="text_obisnuit"><span class="text_obisnuit2">'))
            ):
                if is_significant_length_difference(ro_tags[i], final_en_tags[i]):
                    final_en_tags.insert(i, ro_tags[i])
                    inserted_at.append(i)
                i += 1
            else:
                final_en_tags.insert(i, ro_tags[i])
                inserted_at.append(i)
                i += 1
        else:
            # Dacă i a depășit lungimea final_en_tags, adăugăm restul tagurilor RO
            final_en_tags.extend(ro_tags[i:])
            inserted_at.extend(range(i, len(ro_tags)))
            break

    print("[MAIN] Final RO:", len(ro_tags), "EN after insertions:", len(final_en_tags))
    print("[MAIN] Positions of inserted tags:", inserted_at)

    assert len(ro_tags) <= len(final_en_tags), "Missing paragraphs couldn't be filled out properly..."

    # Asigură-te că directorul de output există
    output_dir = "d:/3/Output"
    os.makedirs(output_dir, exist_ok=True)

    # Citește conținutul fișierului EN
    with open(en_file_path, "r", encoding="utf-8") as file:
        html_content = file.read()

    # Găsește și înlocuiește secțiunea dintre delimitatori
    if final_en_tags:
        # Construiește conținutul nou incluzând delimitatorii
        new_content = "<!-- ARTICOL START -->\n" + "\n".join(final_en_tags) + "\n<!-- ARTICOL FINAL -->"
        # Înlocuiește întreaga secțiune
        res = replace_delimited_text(
            html_content,
            "<!-- ARTICOL START -->",
            "<!-- ARTICOL FINAL -->",
            new_content
        )

        # Salvează rezultatul
        output_path = os.path.join(output_dir, "file.html")
        with open(output_path, "w", encoding="utf-8") as file:
            file.write(res)
        print("[MAIN] Output saved to:", output_path)
    else:
        print("[MAIN] No changes made to save.")

Upvotes: 0

Related Questions