Patrolowaty
Patrolowaty

Reputation: 41

Encoding string during using maps

I have a feeling that my string (with diacritic characters) is in different encoding in my class and in different in hashmap (also "work" for other map instances) String is defined in my class, i try to use it as key in map, put there also some value, and when i try to get this value by key, it's not working. Fun thing - working as expected during intellij evaluate. Some specifics:

IntelliJ IDEA 2019.3.1 (Community Edition) Build #IC-193.5662.53, built on December 18, 2019 Runtime version: 11.0.5+10-b520.17 amd64 VM: OpenJDK 64-Bit Server VM by JetBrains s.r.o Windows 10 10.0 GC: ParNew, ConcurrentMarkSweep Memory: 1986M Cores: 4 Registry: Non-Bundled Plugins:

used SDK Java 1.8.0_231

To check that case is repeatable i create this junit:

@Test
public void test() {
    Map<String, String> map = new TreeMap<>();
    map.put("język", "polski");
    String res = map.get("język");
    System.out.println(res);
}

During putting in hashmap word "język" is converted to "j?zyk" but during getting it from map, it's also converted to "j?zyk" so everything looks fine. But in my productive code it's more complicated. I created map from list of strings using this code:

 private Map<String, String> getBookDetails(HtmlElement from) {
        HtmlElement bookDetails = BOOK_DETAILS.getFirst(from);
        return Arrays.stream(bookDetails.asText().split(BOOK_DETAILS_SEPARATOR))
             .collect(MappingErrors.collector());
    }

bookDetails.asXml:

<div class="collapse d-xs-none" id="book-details">
  <dl>
    <dt>
      
                            Tytuł oryginału:
                        
    </dt>
    <dd>
      
                            Wat?                        
    </dd>
    <dt>
      
                            Data wydania:
                        
    </dt>
    <dd>
      
                            2016-05-16                        
    </dd>
    <dt data-toggle="tooltip" title="Data pierwszego wydania polskiego">
      
                            Data 1. wyd. pol.:
                        
    </dt>
    <dd>
      
                            2016-05-16                        
    </dd>
    <dt>
      
                            Liczba stron:
                        
    </dt>
    <dd>
      
                            20                        
    </dd>
    <dt>
      
                            Język:
                        
    </dt>
    <dd>
      
                            polski                        
    </dd>
    <dt>
      
                            ISBN:
                        
    </dt>
    <dd>
      
                            9788374206600                        
    </dd>
    <dt>
      
                            Tłumacz:
                        
    </dt>
    <dd>
      <a href="https://lubimyczytac.pl/tlumacz/10593/ryszard-turczyn">
        Ryszard Turczyn
      </a>
    </dd>
    <dt class="d-lg-none">
      
                            Wydawnictwo:
                        
    </dt>
    <dd class="d-lg-none">
      <a href="https://lubimyczytac.pl/wydawnictwo/13832/wydawnictwo-adamada/ksiazki">
        Wydawnictwo Adamada
      </a>
    </dd>
  </dl>
</div>

missing variables

private String BOOK_DETAILS_SEPARATOR = "\r\n";

static final DefinedHtmlElement BOOK_DETAILS =
            new DefinedHtmlElement("div", "id", "book-details");
    

DefinedHtmlElement inner class:

static class DefinedHtmlElement {
        String elementName;
        String attributeName;
        String attributeValue;

        DefinedHtmlElement (String elementName, String attributeName, String attributeValue) {
            this.attributeName = attributeName;
            this.elementName = elementName;
            this.attributeValue = attributeValue;
        }

        public String getAttributeName() {
            return attributeName;
        }

        public String getAttributeValue() {
            return attributeValue;
        }

        public String getElementName() {
            return elementName;
        }

        public HtmlElement getFirst(HtmlElement element) {
            return element
                    .getElementsByAttribute(elementName, attributeName, attributeValue)
                    .stream().findFirst().orElse(null);
        }
    }

And collector:

private static final class MappingErrors {

        private static int counter = 1;

        private Map<String, String> map = new TreeMap<>();

        private String first;
        private String second;

        public void accept(String str) {
            first = second;
            second = str;
            if (first != null && counter % 2 == 0) {
                map.put(first.trim(), second.trim());
            }
            counter++;
        }

        public MappingErrors combine(MappingErrors other) {
            throw new UnsupportedOperationException("Parallel Stream not supported");
        }

        public Map<String, String> finish() {
            return map;
        }

        public static Collector<String, ?, Map<String, String>> collector() {
            return Collector.of(MappingErrors::new, MappingErrors::accept, MappingErrors::combine, 
             MappingErrors::finish);
        }

    }

Fun Fact is that during putting into key/value into map it's not converted to question mark version, but it's write as it should be. And when I try to get value by key, key string is converted, any matching key is not find, and code is not working. I try to work with word "Język:" as a key, and get "JÄ>trade mark sign<zyk:". Again during normal run or debug i can't find value by key, but during evaluating it's working as expected.

I have no idea where find root cause. I check that all files have the same encoding (utf-8 and windows 1252 working the same way in this case) all project have set the same encoding, there is no input files, only scraping from webpage, and getting String by com.gargoylesoftware.htmlunit.html.HtmlElement if it's important. Has anyone any idea where to find root cause? Is encoding right clue, or it's something totally different? Of course i can create walkaround to replace all diacritics characters to normal, but i want to understand what is happening

UPDATE: I find out that data from gargoylesoftware are different. It's not a way of filling map, it's not connected to map (in fact map is first place where this is visible). I modify a little code:

private Map<String, String> getBookDetails(HtmlElement from) {
        HtmlElement bookDetails = BOOK_DETAILS.getFirst(from);
        String[] split = bookDetails.asText().split(BOOK_DETAILS_SEPARATOR);
        Map<String, String> mapa = new HashMap<>();
        for (int i=0;i<split.length-1;i+=2) {
            mapa.put(split[i].trim(), split[i+1].trim());
            if (split[i].trim().compareTo("Język:") == 0) {
                System.out.println("test");
            }
        }
        mapa.put("Język:","TEST");
        return mapa;
}

Condition in if is never true. Still it's true only during evaluating, but line with println will never be reached. Object mapa looks like this:

"Data 1. wyd. pol.:" -> "2016-05-16"
"Liczba stron:" -> "20"
"Data wydania:" -> "2016-05-16"
"Tłumacz:" -> "Ryszard Turczyn"
"Język:" -> "TEST"
"Tytuł oryginału:" -> "Wat?"
"Język:" -> "polski"
"Wydawnictwo:" -> "Wydawnictwo Adamada"
"ISBN:" -> "9788374206600"

So manually added entry was somehow changed to "TM" version. But it's ok, because during getting value from this map the same change take place, so value is correct. But why there is a difference beetween manually put string, and this from gargoylesoftware?

Upvotes: 2

Views: 1007

Answers (1)

Patrolowaty
Patrolowaty

Reputation: 41

I found it! It's complicated relations of encoding in intellij, windows and web page. Data in HtmlElement has utf8, String has utf16, windows has his own, and intellij has some combination of all of those. I was playing a little with String constructor and find out the right combination.

new String(labelFromHtmlElement.getBytes("UTF-8"), "windows-1252");

Programming with diacritics characters could be complicated:)

Upvotes: 1

Related Questions