How to read the cedict (a space separated file) with regex groups?

Question

CEDICT is a resource for Chinese text analysis

The file plaintext file looks like this:

# CC-CEDICT
# Community maintained free Chinese-English dictionary.
# 
# Published by MDBG
% % [pa1] /percent (Tw)/
21三體綜合症 21三体综合症 [er4 shi2 yi1 san1 ti3 zong1 he2 zheng4] /trisomy/Down's syndrome/
3C 3C [san1 C] /abbr. for computers, communications, and consumer electronics/China Compulsory Certificate (CCC)/
3P 3P [san1 P] /(slang) threesome/
A A [A] /(slang) (Tw) to steal/

There are 4 columns to the files and they are separated by spaces. Any spaces after the 4th is considered as one. Lines that starts with # needs to be skipped.

E.g. for the line:

3C 3C [san1 C] /abbr. for computers, communications, and consumer electronics/China Compulsory Certificate (CCC)/

The content in the columns would be

3C
3C
[san1 C]
/abbr. for computers, communications, and consumer electronics/China Compulsory Certificate (CCC)/

Currently to read the file I've been tried using a mix of str.split and re.findall and skipping lines by str.startswith(), i.e.:

import re
from collections import namedtuple


DictEntry = namedtuple('Dictionary', 'traditional simplified pinyin glosses')

dictfile = 'cedict_ts.u8'
cedict = {}

with open(dictfile, 'r', encoding='utf8') as fin:
    for line in fin:
        if line.startswith('#'):
            continue
        # Note: lines are NOT separated by tabs.
        line = line.strip()
        trad, sim, *stuff = line.split()
        pinyin = re.findall(r'$$([^]]*)$$',line)[0]
        glosses = re.findall(r'/.*/', line)[0].strip('/').split('/')
        entry = DictEntry(traditional=trad, simplified=sim, pinyin=pinyin, glosses=glosses)
        cedict[sim] = entry

It looks like the str and regex operations can be simiplified into a single regex and the columns can be extracted using groups. How to read the cedict (a space separated file) with regex groups?

I've also tried this regex with 4 groups:

(.*)\s(.*)\s($$([^]]*)$$)\s(/.*/)

But somehow the first (.*)\s is greedy and it captures the whole line: https://regex101.com/r/1c0O0E/1

I've tried this:

.+\s($$([^]]*)$$)\s(/.*/)

And the first .+\s captures till it sees [. But that means that I'll have to use str.split() to get the first 2 columns.

Dmitry Egorov · Accepted Answer

Use "non-space" (\S) instead of just "anything" (.):

^(\S+)\s+(\S+)\s+($$[^]]+$$)\s+(/.*/)$

I also added begin-of-text and end-of-test anchors (^ & $) to rule out any lines not matching the required pattern (e.g. comment lines).

Demo: https://regex101.com/r/0QNzVi/3

How to read the cedict (a space separated file) with regex groups?

Answers (1)

Related Questions