Reputation: 735
I am working through ThinkStats, but decided to learn Pandas along the ways as well. So the code below reads in data from a file, does some checking and then appends the data to a list. I end up with several lists containing the data I need. The code below works (except for scrambling up the columns...)
My question is: What is the best way to build a dataframe from these lists? More generally, am I accomplishing my goal in the most efficient manner?
preglength = []
caseid = []
outcome = []
birthorder = []
finalweight = []
with open('2002FemPreg.dat') as f:
for line in f:
caseid.append(int(line[0:13].strip()))
preglength.append(int(line[274:276].strip()))
outcome.append(int(line[276].strip()))
try:
birthorder.append(int(line[277:279]))
except ValueError:
birthorder.append(np.nan)
finalweight.append(float(line[422:440].strip()))
c1 = pd.Series(caseid)
c2 = pd.Series(preglength)
c3 = pd.Series(outcome)
c4 = pd.Series(birthorder)
c5 = pd.Series(finalweight)
data = pd.DataFrame({'caseid': c1,'preglength': c2,'outcome': c3,'birthorder': c4,'weight': c5})
print(data.head())
Upvotes: 0
Views: 108
Reputation: 353419
I would probably use read_fwf
:
>>> df = pd.read_fwf("2002FemPreg.dat",
... colspecs=[(0,13), (274, 276), (276, 277), (277, 279), (422, 440)],
... names=["caseid", "preglength", "outcome", "birthorder", "finalweight"])
>>> df.head()
caseid preglength outcome birthorder finalweight
0 1 39 1 1 6448.271112
1 1 39 1 2 6448.271112
2 2 39 1 1 12999.542264
3 2 39 1 2 12999.542264
4 2 39 1 3 12999.542264
[5 rows x 5 columns]
Upvotes: 3