Reputation: 71
i have file.txt like this. i want create dataframe with pandas.
# NISN:- 1234567
# FullName:- Joe Doe
# FirstName:- Joe
# LastName:- Doe
# School:- Klima
# E-mail:- [email protected]
# NISN:- 8901234
# FullName:- Jenny Low
# FirstName:- Jenny
# LastName:- Low
# School:- Kimcil
# E-mail:- [email protected]
how to convert dataframe to this?
NISN Fullname FirstName LastName School E-mail
1234567 Joe Doe Joe Doe Klima [email protected]
8901234 Jenny Low Jenny Low Kimcil [email protected]
edited
# NISN:- 123456
7
# FullName:- Joe Doe
# FirstName:- Joe
# LastName:- Doe
# School:- Klima
# E-mail:- [email protected]
# NISN:- 8901234
# FullName:- Jenny Low
# FirstName:- Jenny
# LastName:- Low
# School:- Kimc
il
# E-mail:- [email protected]
Upvotes: 0
Views: 2512
Reputation: 862641
If each group has exactly 6 rows you can use read_csv
with name
parameter for 2 columns, in separator \s*
is regex for zero or more spaces after :-
:
df = pd.read_csv(file, sep=":-\s*", names=['a', 'b'], engine='python')
print (df)
a b
0 # NISN 1234567
1 # FullName Joe Doe
2 # FirstName Joe
3 # LastName Doe
4 # School Klima
5 # E-mail [email protected]
6 # NISN 8901234
7 # FullName Jenny Low
8 # FirstName Jenny
9 # LastName Low
10 # School Kimcil
11 # E-mail [email protected]
Alternative for read file - used separator which not exist in data like |
or ¥
and then use Series.str.split
, but only by first separator by n=1
:
df = pd.read_csv(file, sep="|", names=['data'])
print (df)
data
0 # NISN:- 1234567
1 # FullName:- Joe Doe
2 # FirstName:- Joe
3 # LastName:- Doe
4 # School:- Klima
5 # E-mail:- [email protected]
6 # NISN:- 8901234
7 # FullName:- Jenny Low
8 # FirstName:- Jenny
9 # LastName:- Low
10 # School:- Kimcil
11 # E-mail:- [email protected]
df = df.pop('data').str.split(':-\s', n=1, expand=True)
df.columns = ['a','b']
print (df)
a b
0 # NISN 1234567
1 # FullName Joe Doe
2 # FirstName Joe
3 # LastName Doe
4 # School Klima
5 # E-mail [email protected]
6 # NISN 8901234
7 # FullName Jenny Low
8 # FirstName Jenny
9 # LastName Low
10 # School Kimcil
11 # E-mail [email protected]
Then use Series.str.strip
and reshape valus of column b
by numpy.ndarray.reshape
:
df['a'] = df['a'].str.strip('# ')
df = pd.DataFrame(df.b.to_numpy().reshape(-1, 6), columns = df.a.iloc[:6].rename(None))
print (df)
NISN FullName FirstName LastName School E-mail
0 1234567 Joe Doe Joe Doe Klima [email protected]
1 8901234 Jenny Low Jenny Low Kimcil [email protected]
If possible some values missing, but always NISN
exist for each group use DataFrame.pivot
with helper column for distinguish each group by compare a
by first value NISN
and Series.cumsum
:
df['a'] = df['a'].str.strip('# ')
df['idx'] = df['a'].eq('NISN').cumsum()
df = df.pivot(index='idx', columns='a', values='b').reset_index(drop=True)
print (df)
a E-mail FirstName FullName LastName NISN School
0 [email protected] Joe Joe Doe Doe 1234567 Klima
1 [email protected] Jenny Jenny Low Low 8901234 Kimcil
Upvotes: 2
Reputation: 13407
You can iterate over the lines of your file in python and store the relevant data into a dictionary
before converting it to a DataFrame
import pandas as pd
from collections import defaultdict
data = defaultdict(list)
with open("file.txt") as my_file:
for line in my_file:
line = line.strip("# \n") # clean up whitespace and # for lines
if not line: # skip empty lines
continue
name, value = line.split(":- ")
data[name].append(value)
df = pd.DataFrame.from_dict(data)
print(df)
NISN FullName FirstName LastName School E-mail
0 1234567 Joe Doe Joe Doe Klima [email protected]
1 8901234 Jenny Low Jenny Low Kimcil [email protected]
Upvotes: 1