Reputation: 122142

Given an offset dictionary and a string, create the BIO tags

Given (i) a string text and (ii) a dictionary made of offset and the substring y:

text = "\u5723\u4f9d\u534e\u5802'\"\uff08\uff09\u662f\u4e00\u5ea7\u4f4d\u4e8e\u610f\u5927\u5229\u7f57\u9a6c\u7684\u7f57\u9a6c\u5929\u4e3b\u6559\u6559\u5802\u3002\u8fd9\u5ea7\u6559\u5802\u88ab\u8ba4\u4e3a\u662f\u5df4\u6d1b\u514b\u6559\u5802\u5efa\u7b51\u7684\u6770\u4f5c\uff0c\u7531\u5efa\u7b51\u5e08\u5f17\u6717\u5207\u65af\u79d1\u00b7\u535a\u7f57\u7c73\u5c3c\u4fee\u5efa\u4e8e1642\u81f31660\u5e74\u3002\n\u5386\u53f2.\n\u8fd9\u5ea7\u6559\u5802\u59cb\u5efa\u4e8e14\u4e16\u7eaa\uff0c\u5f53\u65f6\u662f\u4f5c\u4e3a\u7f57\u9a6c\u4e0a\u667a\u5927\u5b66\u5bab\u7684\u5c0f\u6559\u5802\u3002\u8fd9\u6240\u5927\u5b66\u88ab\u79f0\u4e3a\u4e0a\u667a\uff08\"La Sapienza\"\uff09\uff0c\u800c\u8fd9\u5ea7\u6559\u5802\u4f9b\u5949\u7684\u662f\u5723\u4f9d\u534e\uff08Saint Yves\uff0c\u6cd5\u5b66\u5bb6\u7684\u4e3b\u4fdd\u5723\u4eba\uff09\uff0c\u56e0\u800c\u5f97\u540d\u3002\u6ce2\u7f57\u7c73\u5c3c\u88ab\u8feb\u4f7f\u4ed6\u7684\u8bbe\u8ba1\u9002\u5e94\u73b0\u6709\u7684\u5bab\u6bbf\u3002\u4ed6\u9009\u62e9\u4e00\u4e2a\u7c7b\u4f3c\u4e8e\u5927\u536b\u4e4b\u661f\u7684\u89c4\u5212\uff0c\u5c06\u6559\u5802\u7684\u7acb\u9762\u4e0e\u5bab\u6bbf\u7684\u5ead\u9662\u5408\u5e76\u3002\u5176\u7a79\u9876\u6709\u4e00\u4e2a\u975e\u5e38\u65b0\u9896\u7684\u87ba\u65cb\u5f62\u9876\u5854\u3002\u5185\u90e8\u4ee4\u4eba\u773c\u82b1\u7f2d\u4e71\u7684\u51e0\u4f55\u4f53\u6784\u6210\u590d\u6742\u7684\u97f5\u5f8b\u3002\u8fd9\u662f\u4e00\u5ea7\u7406\u6027\u7684\u5efa\u7b51- \u770b\u8d77\u6765\u5f88\u7eb7\u4e71\uff0c\u4f46\u662f\u5176\u5e73\u9762\u91cd\u53e0\u7740\u4e00\u5708\u5708\u4e24\u4e24\u53e0\u52a0\u7684\u7b49\u8fb9\u4e09\u89d2\u5f62\uff0c\u5728\u4e00\u5ea7\u96c6\u4e2d\u7684\u6559\u5802\u5185\uff0c\u6784\u6210\u4e00\u5217\u516d\u89d2\u5f62\u5c0f\u6559\u5802\u548c\u796d\u53f0\u7684\u57fa\u7840\u3002\u5185\u90e8\u7684\u51f9\u51f8\u8d77\u4f0f\uff0c\u5f62\u6210\u4e0d\u548c\u8c10\u7136\u800c\u4f7f\u4eba\u7729\u6655\u7684\u9b45\u529b\u3002\u88c5\u9970\u662f\u516d\u7ffc\u5929\u4f7f\u5934\u50cf\u548c\u51e0\u4f55\u4f53\uff08\u661f\u661f\uff09\u7684\u7ec4\u5408\uff0c\u6bd4\u8d77\u540c\u65f6\u4ee3\u7684\u6d4e\u5b89\u00b7\u8d1d\u5c3c\u5c3c\u8fc7\u5ea6\u4f7f\u7528\u9540\u91d1\u548c\u77f3\u818f\uff0c\u8f83\u4e3a\u67cf\u62c9\u56fe\u3002\u6cbf\u4e09\u6839\u5706\u9876\u652f\u67f1\u57fa\u7840\u5411\u4e0a\uff0c\u662f\u6559\u5b97\u4e9e\u6b77\u5c71\u5927\u4e03\u4e16\u7684\u57fa\u5409\u5bb6\u65cf\u6807\u5fd7\u201c\u4e00\u9897\u661f\u661f\u4e0b\u65b9\u7684\u516d\u5ea7\u5c71\u201d\u3002"

y = {16: '罗马', 19: '罗马天主教', 24: '教堂', 35: '巴洛克', 
     50: '弗朗切斯科·博罗米尼', 96: '罗马上智大学', 142: '圣依华', 
     199: '大卫之星', 376: '济安·贝尼尼', 413: '亞歷山大七世', 420: '基吉家族'}

To see the raw text : https://gist.githubusercontent.com/alvations/4c3460cf2dda5e19d828a8bace532216/raw/28190258c2be8ed7123033e77edefbf9201e1845/text

The goal is to iterate through all the characters in the text and then print

O if the character is Outside/not in the substring dictionary
B if the character is the Beginning/start of a substring in the dictionary
I if the character is Inside/part of a substring in the dictionary

I've tried this which produces the expected output:

bio = []
bi_end = 0

for i, ch in enumerate(text):
    if i in y:
        bio.append('B')
        bi_end = i + len(y[i])
    elif i < bi_end:
        bio.append('I')
    else:
        bio.append('O')

Which kind of worked and produced:

['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'I', 'O', 'B', 'I', 'I', 'I', 'I', 'B', 'I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'I', 'I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'I', 'I', 'I', 'I', 'I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'I', 'I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'I', 'I', 'I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'I', 'I', 'I', 'I', 'I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'I', 'I', 'I', 'I', 'I', 'O', 'B', 'I', 'I', 'I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

Or in string:

>>> ''.join(bio)
'OOOOOOOOOOOOOOOOBIOBIIIIBIOOOOOOOOOBIIOOOOOOOOOOOOBIIIIIIIIIOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOBIIIIIOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOBIIOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOBIIIOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOBIIIIIOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOBIIIIIOBIIIOOOOOOOOOOOOOOO'

But is there a simpler to achieve the same desired output? Maybe with some regex and replacement?

Or some special str.translate table operation?

Upvotes: 1

Answers (2)

Kache

Reputation: 16697

Runtime can be improved by iterating over your substring dictionary instead, since the input text doesn't appear to really matter. (And exit early once key index exceeds input text size.)

Just as an example, esp since the substring dict is already sorted by index, consider how your example's first key is 16. You're guaranteed for the first 16 output chars to be O without even checking the text.

The runtime of this is min(O(n), O(m)) for length of text n and size of dict m. (Though technically as stated you lose the savings by outputting n chars for output.)

Upvotes: 1

J_H

Reputation: 20495

This problem is about ranges, rather than characters. You wrote {16: '罗马', 19: '罗马天主教', ... but it could as easily have been

y = {16: 'xx', 19: 'xxxxx', ...

y = {16: 2, 19: 5, ...

The algorithm you supplied looks great. First case exploits a data structure that offers O(1) constant time access, and the other two cases are even simpler. You need to produce N output characters, and you have a O(N) linear time algorithm. re and .translate() won't offer improvements on that.

Upvotes: 1

Given an offset dictionary and a string, create the BIO tags

Answers (2)

Related Questions