Import .txt to Pandas Dataframe With Multiple Delimiters

Question

I would like to import .txt file into a Pandas Dataframe, my .txt file:

Ann   Gosh  1234567892008-12-15Irvine                CA45678A9Z5Steve        Ryan      
Yosh   Dave    9876543212009-04-18St. Elf              NY12345P8G0Brad      Tuck     
Clair   Simon    3245674572008-12-29New Jersey             NJ56789R9B3Dan     John

The dataframe should look like this:

FirstN    LastN       SID        Birth        City     States    Postal    TeacherFirstN  TeacherLastN
   Ann     Gosh   123456789  2008-12-15     Irvine       CA        A9Z5           Steve           Ryan 
  Yosh     Dave   987654321  2009-04-18    St. Elf       NY        P8G0            Brad           Tuck
 Clair    Simon   324567457  2008-12-29   New Jersey     NJ        R9B3             Dan           John

I tried multiple ways including this:

df =  pd.read_csv('student.txt',  sep='\s+', engine='python', header=None, index_col=False)

to import the raw file into the dataframe, then plan to clean data for each column but it's too complicated. Could you please help me? (the Postal here is just the 4 char before TeacherFirstN)

Bertrand Martel · Accepted Answer

You can start with setting names on you existing columns, and then applying regex on data while creating the new columns.

In order to fix the "single space delimiter" issue in your output, you can define "at least 2 space characters" eg [\s]{2,} as delimiter which would fix the issue for St. Elf in City names

An example :

import pandas as pd 
import re

df =  pd.read_csv(
    'test.txt', 
    sep = '[\s]{2,}', 
    engine = 'python', 
    header = None, 
    index_col = False, 
    names= [
        "FirstN","LastN","FULLSID","TeacherData","TeacherLastN"
    ]
)
sid_pattern = re.compile(r'(\d{9})(\d+-\d+-\d+)(.*)', re.IGNORECASE)
df['SID'] = df.apply(lambda row: sid_pattern.search(row.FULLSID).group(1), axis = 1)
df['Birth'] = df.apply(lambda row: sid_pattern.search(row.FULLSID).group(2), axis = 1)
df['City'] = df.apply(lambda row: sid_pattern.search(row.FULLSID).group(3), axis = 1)

teacherdata_pattern = re.compile(r'(.{2})([\dA-Z]+\d)(.*)', re.IGNORECASE)
df['States'] = df.apply(lambda row: teacherdata_pattern.search(row.TeacherData).group(1), axis = 1)
df['Postal'] = df.apply(lambda row: teacherdata_pattern.search(row.TeacherData).group(2)[-4:], axis = 1)
df['TeacherFirstN'] = df.apply(lambda row: teacherdata_pattern.search(row.TeacherData).group(3), axis = 1)

del df['FULLSID']
del df['TeacherData']

print(df)

Output :

  FirstN  LastN TeacherLastN        SID       Birth        City States Postal TeacherFirstN
0    Ann   Gosh         Ryan  123456789  2008-12-15      Irvine     CA   A9Z5         Steve
1   Yosh   Dave         Tuck  987654321  2009-04-18     St. Elf     NY   P8G0          Brad
2  Clair  Simon         John  324567457  2008-12-29  New Jersey     NJ   R9B3           Dan

Import .txt to Pandas Dataframe With Multiple Delimiters

Answers (1)

Related Questions