Warok
Warok

Reputation: 403

Concatenation of two regexs

I have the following exported text file:

14:00:01 type1 "xyz" has no relationships... ಠ_ಠ
14:00:01 type2 "xyza" has no relationships... ಠ_ಠ
14:00:01 type2 "aaaa" has no relationships... ಠ_ಠ
14:00:01 type3 "asdg" has no relationships... ಠ_ಠ
14:00:01 type4 "dhj" has no relationships... ಠ_ಠ

I'm trying to find a way to retrieve two informations from this file

  1. The type (in this case, the element after the time and before what is inside the double quote)
  2. What is inside the double quote

Output expected:

type1 xyz

type2 xyza

type2 aaaa

type3 asdg

type4 dhj

With my current code, I can get the content inside the double quote, but I don't know how to get the type and merge it with my regex:

import os, yaml
import argparse
import re
with open('stackoverflow.txt') as f:
    content = f.readlines()
    matches=re.findall(r'\"(.+?)\"',str(content))#get the content within the double quote
for x in matches:
    print(x)

Current output:

xyz

xyza

aaaa

asdg

dhj

Upvotes: 2

Views: 60

Answers (4)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626747

You can use the (\w+) "([^"]*)" (you can check its demo here) to get a match on each line and then format the output as you need by grabbing the two groups from each match if found:

import re
matches = []
rx = re.compile(r'(\w+) "([^"]*)"')
with open('stackoverflow.txt') as f:
    for line in f:
        m = rx.search(line)
        if m:
            matches.append(f'{m.group(1)} {m.group(2)}')

See the Python demo:

import re
file=r'''14:00:01 type1 "xyz" has no relationships... ಠ_ಠ
14:00:01 type2 "xyza" has no relationships... ಠ_ಠ
14:00:01 type2 "aaaa" has no relationships... ಠ_ಠ
14:00:01 type3 "asdg" has no relationships... ಠ_ಠ
14:00:01 type4 "dhj" has no relationships... ಠ_ಠ'''
matches = []
rx = re.compile(r'(\w+) "([^"]*)"')
for line in file.splitlines():
    m = rx.search(line)
    if m:
        matches.append(f'{m.group(1)} {m.group(2)}')
            
print(matches)
# => ['type1 xyz', 'type2 xyza', 'type2 aaaa', 'type3 asdg', 'type4 dhj']

Upvotes: 1

Martí
Martí

Reputation: 721

If your txt files will ALWAYS have that structure, I would simply do:

with open('stackoverflow.txt') as f:
    matches = [' '.join(line.split(' ')[1:3]) for line in f.readlines()]

for x in matches:
    print(x)

Output:

type1 "xyz"
type2 "xyza"
type2 "aaaa"
type3 "asdg"
type4 "dhj"

Upvotes: 2

SuperStormer
SuperStormer

Reputation: 5387

Just add a second group to the regex:

import re
with open('stackoverflow.txt') as f:
    content = f.readlines()
    matches = re.findall(r' (\S+) "(.+?)"',content)
for x in matches:
    print(x[0], x[1])

Note: consider using re.finditer and iterating directly over f instead of re.findall and f.readlines, as they use iterators instead of lists.

Upvotes: 1

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521053

Using re.findall:

with open('stackoverflow.txt') as f:
    content = f.readlines()
    matches = re.findall(r'\b\d{2}:\d{2}:\d{2} (\S+) "(.*?)"', content)
    print(matches)

For the data you gave above, matches would contain:

[('type1', 'xyz'), ('type2', 'xyza'), ('type2', 'aaaa'), ('type3', 'asdg'), ('type4', 'dhj')]

Upvotes: 1

Related Questions