Reputation: 25
I have a text file with column header & data. I am trying to convert this file data into pandas DataFrame.
File:
#Columns: TargetDoc|GRank|LRank|Priority|Loc ID
aaaaa|1|1|Slow|8gkahinka.01
aaaaa|1|0|Slow|7nlafnjbaflnbja.01
I wrote below code: Firstly, I converted each line and trying list to convert Dataframe:
import os
import pandas as pd
with open("DocID101_201604070523.txt") as raw_file:
full_file_text = raw_file.readlines()
raw_file.close()
data_list = list()
for l in full_file_text:
if i.startswith('#'):
labels = l.strip().replace('#Columns: ','').split('|')
else:
data_list += l.strip().split('|')
df = PD.DataFrame.from_records(data_list,columns=labels)
But I got error on df:
AssertionError: 5 columns passed, passed data had 10 columns.
What's wrong with my code or is there any better way convert to dataframe ?
Upvotes: 2
Views: 3642
Reputation: 393863
You can just read in the file using read_csv
with sep='|'
and then fix the first column name as a post processing step using rename
:
In [228]:
import io
import pandas as pd
t="""#Columns: TargetDoc|GRank|LRank|Priority|Loc ID
aaaaa|1|1|Slow|8gkahinka.01
aaaaa|1|0|Slow|7nlafnjbaflnbja.01"""
df = pd.read_csv(io.StringIO(t), sep='|')
df
Out[228]:
#Columns: TargetDoc GRank LRank Priority Loc ID
0 aaaaa 1 1 Slow 8gkahinka.01
1 aaaaa 1 0 Slow 7nlafnjbaflnbja.01
Now rename
the first column by passing in the first column name as the key for the passed in dict and split
the string for the new column name:
In [229]:
df.rename(columns={df.columns[0]:df.columns[0].split()[-1]}, inplace=True)
df
Out[229]:
TargetDoc GRank LRank Priority Loc ID
0 aaaaa 1 1 Slow 8gkahinka.01
1 aaaaa 1 0 Slow 7nlafnjbaflnbja.01
So in your case:
df = pd.read_csv("DocID101_201604070523.txt", sep='|')
and then rename
like the above
Upvotes: 3
Reputation: 1484
That's because your are contataining all row into one list with :
data_list += l.strip().split('|')
What you want is :
data_list.append(l.strip().split('|'))
This way, you will get a list of list of 5 elements.
Edit : But the solution above of using csv separator is highly recommended.
Upvotes: 1