Hendra
Hendra

Reputation: 71

Python Create Pandas Dataframe From txt file

i have file.txt like this. i want create dataframe with pandas.

# NISN:- 1234567
# FullName:- Joe Doe
# FirstName:- Joe
# LastName:- Doe
# School:- Klima
# E-mail:- [email protected]

# NISN:- 8901234
# FullName:- Jenny Low
# FirstName:- Jenny
# LastName:- Low
# School:- Kimcil
# E-mail:- [email protected]

how to convert dataframe to this?

NISN    Fullname    FirstName   LastName    School  E-mail
1234567 Joe Doe     Joe     Doe     Klima   [email protected]
8901234 Jenny Low   Jenny       Low     Kimcil  [email protected]

edited


i found sample bad line in file. how to handle this?
# NISN:- 123456

7
# FullName:- Joe Doe
# FirstName:- Joe
# LastName:- Doe
# School:- Klima
# E-mail:- [email protected]



# NISN:- 8901234
# FullName:- Jenny Low
# FirstName:- Jenny
# LastName:- Low
# School:- Kimc

il
# E-mail:- [email protected]

Upvotes: 0

Views: 2512

Answers (2)

jezrael
jezrael

Reputation: 862641

If each group has exactly 6 rows you can use read_csv with name parameter for 2 columns, in separator \s* is regex for zero or more spaces after :-:

df = pd.read_csv(file, sep=":-\s*", names=['a', 'b'], engine='python')
print (df)
              a                b
0        # NISN          1234567
1    # FullName          Joe Doe
2   # FirstName              Joe
3    # LastName              Doe
4      # School            Klima
5      # E-mail    [email protected]
6        # NISN          8901234
7    # FullName        Jenny Low
8   # FirstName            Jenny
9    # LastName              Low
10     # School           Kimcil
11     # E-mail  [email protected]

Alternative for read file - used separator which not exist in data like | or ¥ and then use Series.str.split, but only by first separator by n=1:

df = pd.read_csv(file, sep="|", names=['data'])
print (df)
                          data
0             # NISN:- 1234567
1         # FullName:- Joe Doe
2            # FirstName:- Joe
3             # LastName:- Doe
4             # School:- Klima
5     # E-mail:- [email protected]
6             # NISN:- 8901234
7       # FullName:- Jenny Low
8          # FirstName:- Jenny
9             # LastName:- Low
10           # School:- Kimcil
11  # E-mail:- [email protected]

df = df.pop('data').str.split(':-\s', n=1, expand=True)
df.columns = ['a','b']
print (df)
              a                b
0        # NISN          1234567
1    # FullName          Joe Doe
2   # FirstName              Joe
3    # LastName              Doe
4      # School            Klima
5      # E-mail    [email protected]
6        # NISN          8901234
7    # FullName        Jenny Low
8   # FirstName            Jenny
9    # LastName              Low
10     # School           Kimcil
11     # E-mail  [email protected]

Then use Series.str.strip and reshape valus of column b by numpy.ndarray.reshape:

df['a'] = df['a'].str.strip('# ')
df = pd.DataFrame(df.b.to_numpy().reshape(-1, 6), columns = df.a.iloc[:6].rename(None))
print (df)
      NISN   FullName FirstName LastName  School           E-mail
0  1234567    Joe Doe       Joe      Doe   Klima    [email protected]
1  8901234  Jenny Low     Jenny      Low  Kimcil  [email protected]

If possible some values missing, but always NISN exist for each group use DataFrame.pivot with helper column for distinguish each group by compare a by first value NISN and Series.cumsum:

df['a'] = df['a'].str.strip('# ')
df['idx'] = df['a'].eq('NISN').cumsum()
df = df.pivot(index='idx', columns='a', values='b').reset_index(drop=True)
print (df)
a           E-mail FirstName   FullName LastName     NISN  School
0    [email protected]       Joe    Joe Doe      Doe  1234567   Klima
1  [email protected]     Jenny  Jenny Low      Low  8901234  Kimcil

Upvotes: 2

Cameron Riddell
Cameron Riddell

Reputation: 13407

You can iterate over the lines of your file in python and store the relevant data into a dictionary before converting it to a DataFrame

import pandas as pd
from collections import defaultdict

data = defaultdict(list)
with open("file.txt") as my_file:
    for line in my_file:
        line = line.strip("# \n")        # clean up whitespace and # for lines
        if not line:                     # skip empty lines
            continue

        name, value = line.split(":- ") 
        data[name].append(value)
    
df = pd.DataFrame.from_dict(data)

print(df)
      NISN   FullName FirstName LastName  School           E-mail
0  1234567    Joe Doe       Joe      Doe   Klima    [email protected]
1  8901234  Jenny Low     Jenny      Low  Kimcil  [email protected]

Upvotes: 1

Related Questions