Reputation: 401
I am trying to convert a bunch of text files into a data frame using Pandas.
Each text file contains simple text which starts with two relevant information: the Number
and the Register
variables.
Then, the text files have some random text we should not be taken into consideration.
Last, the text files contains information such as the share number, the name of the person, birth date, address and some additional rows that start with a lowercase letter. Each group contains such information, and the pattern is always the same: the first row for the group is defined by a number (hereby id
), followed by the "SHARE" word.
Here is an example:
Number 01600 London Register 4314
Some random text...
1 SHARE: 73/1284
John Smith
BORN: 1960-01-01 ADR: Streetname 3/2 1000
f 4222/2001
h 1334/2000
i 5774/2000
4 SHARE: 58/1284
Boris Morgan
BORN: 1965-01-01 ADR: Streetname 4 2000
c 4222/1988
f 4222/2000
I need to transform the text into a data frame with the following output, where each group is stored in one row:
Number | Register | City | Id | Share | Name | Born | c | f | h | i |
---|---|---|---|---|---|---|---|---|---|---|
01600 | 4314 | London | 1 | 73/1284 | John Smith | 1960-01-01 | NaN | 4222/2001 | 1334/2000 | 5774/2000 |
01600 | 4314 | London | 4 | 58/1284 | Boris Morgan | 1965-01-01 | 4222/1988 | 4222/2000 | NaN | NaN |
My initial approach was to first import the text file and apply regular expression for each case:
import pandas as pd
import re
df = open(r'Test.txt', 'r').read()
for line in re.findall('SHARE.*', df):
print(line)
But probably there is a better way to do it.
Any help is highly appreciated. Thanks in advance.
Upvotes: 0
Views: 453
Reputation: 9619
This can be done without regex with list comprehension and splitting strings:
import pandas as pd
text = '''Number 01600 London Register 4314
Some random text...
1 SHARE: 73/1284
John Smith
BORN: 1960-01-01 ADR: Streetname 3/2 1000
f 4222/2001
h 1334/2000
i 5774/2000
4 SHARE: 58/1284
Boris Morgan
BORN: 1965-01-01 ADR: Streetname 4 2000
c 4222/1988
f 4222/2000'''
text = [i.strip() for i in text.splitlines()] # create a list of lines
data = []
# extract metadata from first line
number = text[0].split()[1]
city = text[0].split()[2]
register = text[0].split()[4]
# create a list of the index numbers of the lines where new items start
indices = [text.index(i) for i in text if 'SHARE' in i]
# split the list by the retrieved indexes to get a list of lists of items
items = [text[i:j] for i, j in zip([0]+indices, indices+[None])][1:]
for i in items:
d = {'Number': number, 'Register': register, 'City': city, 'Id': int(i[0].split()[0]), 'Share': i[0].split(': ')[1], 'Name': i[1], 'Born': i[2].split()[1], }
items = list(s.split() for s in i[3:])
merged_items = []
for i in items:
if len(i[0]) == 1 and i[0].isalpha():
merged_items.append(i)
else:
merged_items[-1][-1] = merged_items[-1][-1] + i[0]
d.update({name: value for name,value in merged_items})
data.append(d)
#load the list of dicts as a dataframe
df = pd.DataFrame(data)
Output:
Number | Register | City | Id | Share | Name | Born | f | h | i | c | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01600 | 4314 | London | 1 | 73/1284 | John Smith | 1960-01-01 | 4222/2001 | 1334/2000 | 5774/2000 | nan |
1 | 01600 | 4314 | London | 4 | 58/1284 | Boris Morgan | 1965-01-01 | 4222/2000 | nan | nan | 4222/1988 |
Upvotes: 2