Python Text File to Data Frame with Specific Pattern

Question

I am trying to convert a bunch of text files into a data frame using Pandas.

Each text file contains simple text which starts with two relevant information: the Number and the Register variables.

Then, the text files have some random text we should not be taken into consideration.

Last, the text files contains information such as the share number, the name of the person, birth date, address and some additional rows that start with a lowercase letter. Each group contains such information, and the pattern is always the same: the first row for the group is defined by a number (hereby id), followed by the "SHARE" word.

Here is an example:

Number 01600 London                           Register  4314

Some random text...

 1 SHARE: 73/1284
   John Smith
   BORN: 1960-01-01 ADR: Streetname 3/2   1000
   f 4222/2001
   h 1334/2000
   i 5774/2000
 4 SHARE: 58/1284
   Boris Morgan
   BORN: 1965-01-01 ADR: Streetname 4   2000
   c 4222/1988
   f 4222/2000

I need to transform the text into a data frame with the following output, where each group is stored in one row:

Number	Register	City	Id	Share	Name	Born	c	f	h	i
01600	4314	London	1	73/1284	John Smith	1960-01-01	NaN	4222/2001	1334/2000	5774/2000
01600	4314	London	4	58/1284	Boris Morgan	1965-01-01	4222/1988	4222/2000	NaN	NaN

My initial approach was to first import the text file and apply regular expression for each case:

import pandas as pd
import re

df = open(r'Test.txt', 'r').read()

for line in re.findall('SHARE.*', df):
   print(line)

But probably there is a better way to do it.

Any help is highly appreciated. Thanks in advance.

RJ Adriaansen · Accepted Answer

This can be done without regex with list comprehension and splitting strings:

import pandas as pd

text = '''Number 01600 London                           Register  4314

Some random text...

 1 SHARE: 73/1284
   John Smith
   BORN: 1960-01-01 ADR: Streetname 3/2   1000
   f 4222/2001
   h 1334/2000
   i 5774/2000
 4 SHARE: 58/1284
   Boris Morgan
   BORN: 1965-01-01 ADR: Streetname 4   2000
   c 4222/1988
   f 4222/2000'''

text = [i.strip() for i in text.splitlines()] # create a list of lines

data = []

# extract metadata from first line
number = text[0].split()[1]
city = text[0].split()[2]
register = text[0].split()[4]

# create a list of the index numbers of the lines where new items start
indices = [text.index(i) for i in text if 'SHARE' in i]
# split the list by the retrieved indexes to get a list of lists of items
items = [text[i:j] for i, j in zip([0]+indices, indices+[None])][1:]

for i in items:
    d = {'Number': number, 'Register': register, 'City': city, 'Id': int(i[0].split()[0]), 'Share': i[0].split(': ')[1], 'Name': i[1], 'Born': i[2].split()[1], }
    items = list(s.split() for s in i[3:])
    merged_items = []

    for i in items:
        if len(i[0]) == 1 and i[0].isalpha():
            merged_items.append(i)
        else:
            merged_items[-1][-1] = merged_items[-1][-1] + i[0]
    d.update({name: value for name,value in merged_items})
    data.append(d)

#load the list of dicts as a dataframe
df = pd.DataFrame(data)

Output:

	Number	Register	City	Id	Share	Name	Born	f	h	i	c
0	01600	4314	London	1	73/1284	John Smith	1960-01-01	4222/2001	1334/2000	5774/2000	nan
1	01600	4314	London	4	58/1284	Boris Morgan	1965-01-01	4222/2000	nan	nan	4222/1988

Python Text File to Data Frame with Specific Pattern

Answers (1)

Related Questions