Reputation: 403
I have the following exported text file:
14:00:01 type1 "xyz" has no relationships... ಠ_ಠ
14:00:01 type2 "xyza" has no relationships... ಠ_ಠ
14:00:01 type2 "aaaa" has no relationships... ಠ_ಠ
14:00:01 type3 "asdg" has no relationships... ಠ_ಠ
14:00:01 type4 "dhj" has no relationships... ಠ_ಠ
I'm trying to find a way to retrieve two informations from this file
Output expected:
type1 xyz
type2 xyza
type2 aaaa
type3 asdg
type4 dhj
With my current code, I can get the content inside the double quote, but I don't know how to get the type and merge it with my regex:
import os, yaml
import argparse
import re
with open('stackoverflow.txt') as f:
content = f.readlines()
matches=re.findall(r'\"(.+?)\"',str(content))#get the content within the double quote
for x in matches:
print(x)
Current output:
xyz
xyza
aaaa
asdg
dhj
Upvotes: 2
Views: 60
Reputation: 626747
You can use the (\w+) "([^"]*)"
(you can check its demo here) to get a match on each line and then format the output as you need by grabbing the two groups from each match if found:
import re
matches = []
rx = re.compile(r'(\w+) "([^"]*)"')
with open('stackoverflow.txt') as f:
for line in f:
m = rx.search(line)
if m:
matches.append(f'{m.group(1)} {m.group(2)}')
See the Python demo:
import re
file=r'''14:00:01 type1 "xyz" has no relationships... ಠ_ಠ
14:00:01 type2 "xyza" has no relationships... ಠ_ಠ
14:00:01 type2 "aaaa" has no relationships... ಠ_ಠ
14:00:01 type3 "asdg" has no relationships... ಠ_ಠ
14:00:01 type4 "dhj" has no relationships... ಠ_ಠ'''
matches = []
rx = re.compile(r'(\w+) "([^"]*)"')
for line in file.splitlines():
m = rx.search(line)
if m:
matches.append(f'{m.group(1)} {m.group(2)}')
print(matches)
# => ['type1 xyz', 'type2 xyza', 'type2 aaaa', 'type3 asdg', 'type4 dhj']
Upvotes: 1
Reputation: 721
If your txt files will ALWAYS have that structure, I would simply do:
with open('stackoverflow.txt') as f:
matches = [' '.join(line.split(' ')[1:3]) for line in f.readlines()]
for x in matches:
print(x)
Output:
type1 "xyz"
type2 "xyza"
type2 "aaaa"
type3 "asdg"
type4 "dhj"
Upvotes: 2
Reputation: 5387
Just add a second group to the regex:
import re
with open('stackoverflow.txt') as f:
content = f.readlines()
matches = re.findall(r' (\S+) "(.+?)"',content)
for x in matches:
print(x[0], x[1])
Note: consider using re.finditer
and iterating directly over f instead of re.findall
and f.readlines
, as they use iterators instead of lists.
Upvotes: 1
Reputation: 521053
Using re.findall
:
with open('stackoverflow.txt') as f:
content = f.readlines()
matches = re.findall(r'\b\d{2}:\d{2}:\d{2} (\S+) "(.*?)"', content)
print(matches)
For the data you gave above, matches
would contain:
[('type1', 'xyz'), ('type2', 'xyza'), ('type2', 'aaaa'), ('type3', 'asdg'), ('type4', 'dhj')]
Upvotes: 1