Reputation: 1189

How can I match strings that may or may not exist in regex but have placeholders if the match does not exist

Suppose I have a big text file in the following form

[Surname: "Gordon"]
[Name: "James"]
[Age: "13"]
[Weight: "46"]
[Height: "12"]
[Quote: "I want to be a pilot"]

[Name: "Monica"]
[Weight: "33"]
[Quote: "I am looking forward to christmas"]

There are in total 8 keys which will always be in the order of "Surname","Name","Age","Weight","Height","School","Siblings","Quote" which I know beforehand. As you can see, some profiles do not have the full set of variables. The only thing you can be sure will exist is the name.

I want to create a pandas dataframe with each observation as a row and each column as a key. In the case of James, since he does not have the entries in "School" and "Sibling" I would like the entries of those cells to be the numpy nan object.

My attempt is using something like (?:\[Surname: \"()\"\]) for every variable. But even for the single case of surname I run into problems. If surname does not exist, it returns no place holders just the empty list.

Update:

As an example, I would like the return for monica's profile to be ('','Monica','','33','','','','I am looking forward to christmas')

Upvotes: 3

Answers (4)

kantal

Reputation: 2407

Create a list of (key,value) tuples for each info block with re.findall(), and put them in separate dictionaries:

text="""[Surname: "Gordon"]
[Name: "James"]
[Age: "13"]
[Weight: "46"]
[Height: "12"]
[Quote: "I want to be a pilot"]

[Name: "Monica"]
[Weight: "33"]
[Quote: "I am looking forward to christmas"]"""

keys=['Surname','Name','Age','Weight','Height','Quote']

rslt=[{}]

for k,v in re.findall(r"(?m)(?:^\s*\[(\w+):\s*\"\s*([^\]\"]+)\"\s*\])+",text):
    d=rslt[-1]
    if (k=="Surname" and d) or (k=="Name" and "Name" in d):
        d={}
        rslt.append(d)
    d[k]=v

for d in rslt:
    print( [d.get(k,'') for k in keys] )

Out:
['Gordon', 'James', '13', '46', '12', 'I want to be a pilot']
['', 'Monica', '', '33', '', 'I am looking forward to christmas']

Upvotes: 0

Patrick Artner

Reputation: 51683

You can rewrite your data file. The code parses your original file into classes D, then uses csv.DictWriter to write it into a normal style csv that should be readable by pandas:

Create demo file:

fn = "t.txt"
with open (fn,"w") as f:
    f.write("""
[Surname: "Gordon"]
[Name: "James"]
[Age: "13"]
[Weight: "46"]
[Height: "12"]
[Quote: "I want to be a pilot"]

[Name: "Monica"]
[Weight: "33"]
[Quote: "I am looking forward to christmas"]
""")

Itermediate class:

class D:
    fields = ["Surname","Name","Age","Weight","Height","Quote"]

    def __init__(self,textlines):
        t = [(k.strip(),v.strip()) for k,v in (x.strip().split(":",1) for x in textlines)]
        self.data = {k:"" for k in D.fields}
        self.data.update(t) 

    def surname(self):    return self.data["Surname"]
    def name(self):       return self.data["Name"]
    def age(self):        return self.data["Age"]
    def weight(self):     return self.data["Weight"]
    def height(self):     return self.data["Height"]
    def quote(self):      return self.data["Quote"]

    def get_data(self):
        return self.data

Parsing and rewriting:

fn = "t.txt"

# list of all collected D-Instances
data = []
with open(fn) as f:
    # each dataset contains all lines belonging to one "person"
    dataset = []
    surname = False
    for line in f.readlines():
        clean = line.strip().strip("[]")
        if clean and (clean.startswith("Surname") or clean.startswith("Name")):
            if any(e.startswith("Name") for e in dataset):
                data.append(D(dataset))
                dataset = []
                if clean:
                    dataset.append(clean)
            else:
                if clean:
                    dataset.append(clean)
        elif clean:
            dataset.append(clean)
    if dataset:
        data.append(D(dataset))

import csv
with open("other.txt", "w", newline="") as f:
    dw = csv.DictWriter(f,fieldnames=D.fields)
    dw.writeheader()
    for entry in data:
        dw.writerow(entry.get_data())

Check what was written:

with open("other.txt","r") as f:
    print(f.read())

Output:

Surname,Name,Age,Weight,Height,Quote
"""Gordon""","""James""","""13""","""46""","""12""","""I want to be a pilot"""
,"""Monica""",,"""33""",,"""I am looking forward to christmas"""

Upvotes: 0

Dani Mesejo

Reputation: 61920

Building on @WiktorStribiżew comment, you could use groupby (from itertools) to group the lines into empty lines and data lines, for instance like this:

import re
from itertools import groupby

text = '''[Surname: "Gordon"]
[Name: "James"]
[Age: "13"]
[Weight: "46"]
[Height: "12"]
[Quote: "I want to be a pilot"]

[Name: "Monica"]
[Weight: "33"]
[Quote: "I am looking forward to christmas"]

[Name: "John"]
[Height: "33"]
[Quote: "I am looking forward to christmas"]

[Surname: "Gordon"]
[Name: "James"]
[Height: "44"]
[Quote: "I am looking forward to christmas"]'''

patterns = [re.compile('(\[Surname: "(?P<surname>\w+?)"\])'),
            re.compile('(\[Name: "(?P<name>\w+?)"\])'),
            re.compile('(\[Age: "(?P<age>\d+?)"\])'),
            re.compile('\[Weight: "(?P<weight>\d+?)"\]'),
            re.compile('\[Height: "(?P<height>\d+?)"\]'),
            re.compile('\[Quote: "(?P<quote>.+?)"\]')]

records = []
for non_empty, group in groupby(text.splitlines(), key=lambda l: bool(l.strip())):
    if non_empty:
        lines = list(group)
        record = {}
        for line in lines:
            for pattern in patterns:
                match = pattern.search(line)
                if match:
                    record.update(match.groupdict())
                    break
        records.append(record)

for record in records:
    print(record)

Output

{'weight': '46', 'quote': 'I want to be a pilot', 'age': '13', 'name': 'James', 'height': '12', 'surname': 'Gordon'}
{'weight': '33', 'quote': 'I am looking forward to christmas', 'name': 'Monica'}
{'height': '33', 'quote': 'I am looking forward to christmas', 'name': 'John'}
{'height': '44', 'surname': 'Gordon', 'quote': 'I am looking forward to christmas', 'name': 'James'}

Note: This creates a dictionary where the keys are the field names and the values are the values of each, this format does not match your intended output, but I believe is more complete that what you requested. In any case you can easily convert from this format into the desired tuple format.

Explanation

The groupby function from itertools groups the input data into contiguous groups of empty lines and record lines. Then you only need to process the groups that are not empty. The processing is simple for each line try to match a pattern if the pattern is matched break, assuming the lines are exclusive for each match update the record dictionary with the value of the field, leveraging named groups.

Upvotes: 0

Ajax1234

Reputation: 71461

You can parse the file data, group the results, and pass to a dataframe:

import re
import pandas as pd
def group_results(d):
   _group = [d[0]]
   for a, b in d[1:]:
     if a == 'Name' and not any(c == 'Name' for c, _ in _group):
       _group.append([a, b])
     elif a == 'Surname' and any(c == 'Name' for c, _ in _group):
       yield _group
       _group = [[a, b]]
     else:
       if a == 'Name':
         yield _group
         _group = [[a, b]]
       else:
         _group.append([a, b])
   yield _group

headers = ["Surname","Name","Age","Weight","Height","School","Siblings","Quote"]
data = list(filter(None, [i.strip('\n') for i in open('filename.txt')]))
parsed = [(lambda x:[x[0], x[-1][1:-1]])(re.findall('(?<=^\[)\w+|".*?"(?=\]$)', i)) for i in data]
_grouped = list(map(dict, group_results(parsed)))
result = pd.DataFrame([[c.get(i, "") for i in headers] for c in _grouped], columns=headers)

Output:

  Surname    Name                ...                 Siblings                              Quote
0  Gordon   James                ...                                        I want to be a pilot
1          Monica                ...                           I am looking forward to christmas

[2 rows x 8 columns]

Upvotes: 1

How can I match strings that may or may not exist in regex but have placeholders if the match does not exist

Answers (4)

Related Questions