coredumped0x
coredumped0x

Reputation: 848

Regular Expression: return everything in a line prior to the second tab occurence

I have a corpus file that is containing data in the following format:

Hi.   bonjour.  CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #629296 (Samer)
black!  noir!   CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #1245450 (saeb)

Essentially spllited into three fields with \t. e.g:

Hi \t bonjour \t CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #629296 (Samer)

I'm trying to only get the key:value:

Hi.   bonjour.
black!  noir!

and avoid everything else that comes next. This is how I used to do it before the extra metadata was added following the key:value:

def load_doc(filename):
    with codecs.open(filename, "r+", "utf-8") as file:
        file = file.read()
        return file


def to_pairs(doc):
    lines = doc.strip().split('\n')
    pairs = [line.split('\t') for line in  lines]
    return pairs

pairs = to_pairs(load_doc(filename))

Thank you for you help!

Upvotes: 0

Views: 203

Answers (2)

Toto
Toto

Reputation: 91438

Here is a way to do the job:

import re

lines = [
    'Hi.\tbonjour.\tCC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #629296 (Samer)',
    'black!\tnoir!\tCC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #1245450 (saeb)',
]

for line in lines:
    pairs = re.search(r'^(.+?)\t(.+?)(?=\t)', line)
    print(pairs.groups())

    #added parentheses to built-in method 'print' OP Python v3+

Output:

('Hi.', 'bonjour.')
('black!', 'noir!')

Upvotes: 1

The fourth bird
The fourth bird

Reputation: 163372

You could use 2 negated character classes and 2 capturing groups.

^([^\t]+)\t([^\t]+)
  • ^ Start of string (can be omitted using re.match)
  • ([^\t]+) Capture group 1 Match any char except a tab
  • \t Match a tab
  • ([^\t]+) Capture group 2 Match any char except a tab

Regex demo | Python demo

If you don't want to cross a newline, you could add that to the character class [^\t\r\n]

For example:

import re

doc = ("Hi. bonjour.    CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #629296 (Samer)\n"
            "black! noir!   CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #1245450 (saeb)")

lines = doc.strip().split('\n')
pairs = [re.match(r"([^\t]+)\t([^\t]+)", line).groups() for line in lines]
print (pairs)

Output

[('Hi.', 'bonjour.'), ('black!', 'noir!')]

Upvotes: 1

Related Questions