Reputation: 1189
Suppose I have a big text file in the following form
[Surname: "Gordon"]
[Name: "James"]
[Age: "13"]
[Weight: "46"]
[Height: "12"]
[Quote: "I want to be a pilot"]
[Name: "Monica"]
[Weight: "33"]
[Quote: "I am looking forward to christmas"]
There are in total 8 keys which will always be in the order of "Surname","Name","Age","Weight","Height","School","Siblings","Quote" which I know beforehand. As you can see, some profiles do not have the full set of variables. The only thing you can be sure will exist is the name.
I want to create a pandas dataframe with each observation as a row and each column as a key. In the case of James, since he does not have the entries in "School" and "Sibling" I would like the entries of those cells to be the numpy nan object.
My attempt is using something like (?:\[Surname: \"()\"\])
for every variable. But even for the single case of surname I run into problems. If surname does not exist, it returns no place holders just the empty list.
Update:
As an example, I would like the return for monica's profile to be ('','Monica','','33','','','','I am looking forward to christmas')
Upvotes: 3
Views: 160
Reputation: 2407
Create a list of (key,value) tuples for each info block with re.findall(), and put them in separate dictionaries:
text="""[Surname: "Gordon"]
[Name: "James"]
[Age: "13"]
[Weight: "46"]
[Height: "12"]
[Quote: "I want to be a pilot"]
[Name: "Monica"]
[Weight: "33"]
[Quote: "I am looking forward to christmas"]"""
keys=['Surname','Name','Age','Weight','Height','Quote']
rslt=[{}]
for k,v in re.findall(r"(?m)(?:^\s*\[(\w+):\s*\"\s*([^\]\"]+)\"\s*\])+",text):
d=rslt[-1]
if (k=="Surname" and d) or (k=="Name" and "Name" in d):
d={}
rslt.append(d)
d[k]=v
for d in rslt:
print( [d.get(k,'') for k in keys] )
Out:
['Gordon', 'James', '13', '46', '12', 'I want to be a pilot']
['', 'Monica', '', '33', '', 'I am looking forward to christmas']
Upvotes: 0
Reputation: 51683
You can rewrite your data file. The code parses your original file into classes D, then uses csv.DictWriter to write it into a normal style csv that should be readable by pandas:
Create demo file:
fn = "t.txt"
with open (fn,"w") as f:
f.write("""
[Surname: "Gordon"]
[Name: "James"]
[Age: "13"]
[Weight: "46"]
[Height: "12"]
[Quote: "I want to be a pilot"]
[Name: "Monica"]
[Weight: "33"]
[Quote: "I am looking forward to christmas"]
""")
Itermediate class:
class D:
fields = ["Surname","Name","Age","Weight","Height","Quote"]
def __init__(self,textlines):
t = [(k.strip(),v.strip()) for k,v in (x.strip().split(":",1) for x in textlines)]
self.data = {k:"" for k in D.fields}
self.data.update(t)
def surname(self): return self.data["Surname"]
def name(self): return self.data["Name"]
def age(self): return self.data["Age"]
def weight(self): return self.data["Weight"]
def height(self): return self.data["Height"]
def quote(self): return self.data["Quote"]
def get_data(self):
return self.data
Parsing and rewriting:
fn = "t.txt"
# list of all collected D-Instances
data = []
with open(fn) as f:
# each dataset contains all lines belonging to one "person"
dataset = []
surname = False
for line in f.readlines():
clean = line.strip().strip("[]")
if clean and (clean.startswith("Surname") or clean.startswith("Name")):
if any(e.startswith("Name") for e in dataset):
data.append(D(dataset))
dataset = []
if clean:
dataset.append(clean)
else:
if clean:
dataset.append(clean)
elif clean:
dataset.append(clean)
if dataset:
data.append(D(dataset))
import csv
with open("other.txt", "w", newline="") as f:
dw = csv.DictWriter(f,fieldnames=D.fields)
dw.writeheader()
for entry in data:
dw.writerow(entry.get_data())
Check what was written:
with open("other.txt","r") as f:
print(f.read())
Output:
Surname,Name,Age,Weight,Height,Quote
"""Gordon""","""James""","""13""","""46""","""12""","""I want to be a pilot"""
,"""Monica""",,"""33""",,"""I am looking forward to christmas"""
Upvotes: 0
Reputation: 61920
Building on @WiktorStribiżew comment, you could use groupby (from itertools) to group the lines into empty lines and data lines, for instance like this:
import re
from itertools import groupby
text = '''[Surname: "Gordon"]
[Name: "James"]
[Age: "13"]
[Weight: "46"]
[Height: "12"]
[Quote: "I want to be a pilot"]
[Name: "Monica"]
[Weight: "33"]
[Quote: "I am looking forward to christmas"]
[Name: "John"]
[Height: "33"]
[Quote: "I am looking forward to christmas"]
[Surname: "Gordon"]
[Name: "James"]
[Height: "44"]
[Quote: "I am looking forward to christmas"]'''
patterns = [re.compile('(\[Surname: "(?P<surname>\w+?)"\])'),
re.compile('(\[Name: "(?P<name>\w+?)"\])'),
re.compile('(\[Age: "(?P<age>\d+?)"\])'),
re.compile('\[Weight: "(?P<weight>\d+?)"\]'),
re.compile('\[Height: "(?P<height>\d+?)"\]'),
re.compile('\[Quote: "(?P<quote>.+?)"\]')]
records = []
for non_empty, group in groupby(text.splitlines(), key=lambda l: bool(l.strip())):
if non_empty:
lines = list(group)
record = {}
for line in lines:
for pattern in patterns:
match = pattern.search(line)
if match:
record.update(match.groupdict())
break
records.append(record)
for record in records:
print(record)
Output
{'weight': '46', 'quote': 'I want to be a pilot', 'age': '13', 'name': 'James', 'height': '12', 'surname': 'Gordon'}
{'weight': '33', 'quote': 'I am looking forward to christmas', 'name': 'Monica'}
{'height': '33', 'quote': 'I am looking forward to christmas', 'name': 'John'}
{'height': '44', 'surname': 'Gordon', 'quote': 'I am looking forward to christmas', 'name': 'James'}
Note: This creates a dictionary where the keys are the field names and the values are the values of each, this format does not match your intended output, but I believe is more complete that what you requested. In any case you can easily convert from this format into the desired tuple format.
Explanation
The groupby function from itertools groups the input data into contiguous groups of empty lines and record lines. Then you only need to process the groups that are not empty. The processing is simple for each line try to match a pattern if the pattern is matched break, assuming the lines are exclusive for each match update the record
dictionary with the value of the field, leveraging named groups.
Upvotes: 0
Reputation: 71461
You can parse the file data, group the results, and pass to a dataframe:
import re
import pandas as pd
def group_results(d):
_group = [d[0]]
for a, b in d[1:]:
if a == 'Name' and not any(c == 'Name' for c, _ in _group):
_group.append([a, b])
elif a == 'Surname' and any(c == 'Name' for c, _ in _group):
yield _group
_group = [[a, b]]
else:
if a == 'Name':
yield _group
_group = [[a, b]]
else:
_group.append([a, b])
yield _group
headers = ["Surname","Name","Age","Weight","Height","School","Siblings","Quote"]
data = list(filter(None, [i.strip('\n') for i in open('filename.txt')]))
parsed = [(lambda x:[x[0], x[-1][1:-1]])(re.findall('(?<=^\[)\w+|".*?"(?=\]$)', i)) for i in data]
_grouped = list(map(dict, group_results(parsed)))
result = pd.DataFrame([[c.get(i, "") for i in headers] for c in _grouped], columns=headers)
Output:
Surname Name ... Siblings Quote
0 Gordon James ... I want to be a pilot
1 Monica ... I am looking forward to christmas
[2 rows x 8 columns]
Upvotes: 1