Reputation: 848
I have a corpus file that is containing data in the following format:
Hi. bonjour. CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #629296 (Samer)
black! noir! CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #1245450 (saeb)
Essentially spllited into three fields with \t. e.g:
Hi \t bonjour \t CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #629296 (Samer)
I'm trying to only get the key:value:
Hi. bonjour.
black! noir!
and avoid everything else that comes next. This is how I used to do it before the extra metadata was added following the key:value:
def load_doc(filename):
with codecs.open(filename, "r+", "utf-8") as file:
file = file.read()
return file
def to_pairs(doc):
lines = doc.strip().split('\n')
pairs = [line.split('\t') for line in lines]
return pairs
pairs = to_pairs(load_doc(filename))
Thank you for you help!
Upvotes: 0
Views: 203
Reputation: 91438
Here is a way to do the job:
import re
lines = [
'Hi.\tbonjour.\tCC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #629296 (Samer)',
'black!\tnoir!\tCC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #1245450 (saeb)',
]
for line in lines:
pairs = re.search(r'^(.+?)\t(.+?)(?=\t)', line)
print(pairs.groups())
#added parentheses to built-in method 'print' OP Python v3+
Output:
('Hi.', 'bonjour.')
('black!', 'noir!')
Upvotes: 1
Reputation: 163372
You could use 2 negated character classes and 2 capturing groups.
^([^\t]+)\t([^\t]+)
^
Start of string (can be omitted using re.match)([^\t]+)
Capture group 1 Match any char except a tab\t
Match a tab([^\t]+)
Capture group 2 Match any char except a tabIf you don't want to cross a newline, you could add that to the character class [^\t\r\n]
For example:
import re
doc = ("Hi. bonjour. CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #629296 (Samer)\n"
"black! noir! CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #1245450 (saeb)")
lines = doc.strip().split('\n')
pairs = [re.match(r"([^\t]+)\t([^\t]+)", line).groups() for line in lines]
print (pairs)
Output
[('Hi.', 'bonjour.'), ('black!', 'noir!')]
Upvotes: 1