roboes
roboes

Reputation: 401

Python Text File to Data Frame with Specific Pattern

I am trying to convert a bunch of text files into a data frame using Pandas.

Each text file contains simple text which starts with two relevant information: the Number and the Register variables.

Then, the text files have some random text we should not be taken into consideration.

Last, the text files contains information such as the share number, the name of the person, birth date, address and some additional rows that start with a lowercase letter. Each group contains such information, and the pattern is always the same: the first row for the group is defined by a number (hereby id), followed by the "SHARE" word.

Here is an example:

Number 01600 London                           Register  4314

Some random text...

 1 SHARE: 73/1284
   John Smith
   BORN: 1960-01-01 ADR: Streetname 3/2   1000
   f 4222/2001
   h 1334/2000
   i 5774/2000
 4 SHARE: 58/1284
   Boris Morgan
   BORN: 1965-01-01 ADR: Streetname 4   2000
   c 4222/1988
   f 4222/2000

I need to transform the text into a data frame with the following output, where each group is stored in one row:

Number Register City Id Share Name Born c f h i
01600 4314 London 1 73/1284 John Smith 1960-01-01 NaN 4222/2001 1334/2000 5774/2000
01600 4314 London 4 58/1284 Boris Morgan 1965-01-01 4222/1988 4222/2000 NaN NaN

My initial approach was to first import the text file and apply regular expression for each case:

import pandas as pd
import re

df = open(r'Test.txt', 'r').read()

for line in re.findall('SHARE.*', df):
   print(line)

But probably there is a better way to do it.

Any help is highly appreciated. Thanks in advance.

Upvotes: 0

Views: 453

Answers (1)

RJ Adriaansen
RJ Adriaansen

Reputation: 9619

This can be done without regex with list comprehension and splitting strings:

import pandas as pd

text = '''Number 01600 London                           Register  4314

Some random text...

 1 SHARE: 73/1284
   John Smith
   BORN: 1960-01-01 ADR: Streetname 3/2   1000
   f 4222/2001
   h 1334/2000
   i 5774/2000
 4 SHARE: 58/1284
   Boris Morgan
   BORN: 1965-01-01 ADR: Streetname 4   2000
   c 4222/1988
   f 4222/2000'''

text = [i.strip() for i in text.splitlines()] # create a list of lines

data = []

# extract metadata from first line
number = text[0].split()[1]
city = text[0].split()[2]
register = text[0].split()[4]

# create a list of the index numbers of the lines where new items start
indices = [text.index(i) for i in text if 'SHARE' in i]
# split the list by the retrieved indexes to get a list of lists of items
items = [text[i:j] for i, j in zip([0]+indices, indices+[None])][1:]

for i in items:
    d = {'Number': number, 'Register': register, 'City': city, 'Id': int(i[0].split()[0]), 'Share': i[0].split(': ')[1], 'Name': i[1], 'Born': i[2].split()[1], }
    items = list(s.split() for s in i[3:])
    merged_items = []

    for i in items:
        if len(i[0]) == 1 and i[0].isalpha():
            merged_items.append(i)
        else:
            merged_items[-1][-1] = merged_items[-1][-1] + i[0]
    d.update({name: value for name,value in merged_items})
    data.append(d)

#load the list of dicts as a dataframe
df = pd.DataFrame(data)

Output:

Number Register City Id Share Name Born f h i c
0 01600 4314 London 1 73/1284 John Smith 1960-01-01 4222/2001 1334/2000 5774/2000 nan
1 01600 4314 London 4 58/1284 Boris Morgan 1965-01-01 4222/2000 nan nan 4222/1988

Upvotes: 2

Related Questions