Reputation: 23567
I just got my hands on pandas and am figuring out how I can read a file. The file is from the WRDS database and is the SP500 constituents list all the way back to the 1960s. I checked the file and no matter what I do to import it using read_csv
, I still can't display the data correctly.
df = read_csv('sp500-sb.txt')
df
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1231 entries, 0 to 1230
Data columns: gvkeyx from thru conm
gvkey co_conm
...(the column names)
dtypes: object(1)
What does the above chunk of output mean? Anything would be helpful.
Upvotes: 16
Views: 27700
Reputation: 4375
pandas.read_fwf()
was added in pandas 0.7.3 (April 2012) to handle fixed-width files.
Upvotes: 28
Reputation: 23567
Wes answered me in an email. Cheers.
This is a fixed-width-format file (not delimited by commas or tabs as usual). I realize that pandas does not have a fixed-width reader like R does, though one can be fashioned very easily. I'll see what I can do. In the meantime if you can export the data in another format (like csv--truly comma separated) you'll be able to read it with read_csv. I suspect with some unix magic you can transform a FWF file into a CSV file.
I recommend following the issue on github as your e-mail is about to disappear from my inbox :)
https://github.com/pydata/pandas/issues/920
best, Wes
Upvotes: 10
Reputation: 200
user, if you need to deal with the fixed format right now, you can use something like the following:
def fixed_width_to_items(filename, fields, first_column_is_index=False, ignore_first_rows=0):
reader = open(filename, 'r')
# skip first rows
for i in xrange(ignore_first_rows):
reader.next()
if first_column_is_index:
index = slice(0, fields[1])
fields = [slice(*x) for x in zip(fields[1:-1], fields[2:])]
return ((line[index], [line[x].strip() for x in fields]) for line in reader)
else:
fields = [slice(*x) for x in zip(fields[:-1], fields[1:])]
return ((i, [line[x].strip() for x in fields]) for i,line in enumerate(reader))
Here's a test program:
import pandas
import numpy
import tempfile
# create a data frame
df = pandas.DataFrame(numpy.random.randn(100, 5))
file_ = tempfile.NamedTemporaryFile(delete=True)
file_.write(df.to_string())
file_.flush()
# specify fields
fields = [0, 3, 12, 22, 32, 42, 52]
df2 = pandas.DataFrame.from_items( fixed_width_to_items(file_.name, fields, first_column_is_index=True, ignore_first_rows=1) ).T
# need to specify the datatypes, otherwise everything is a string
df2 = pandas.DataFrame(df2, dtype=float)
df2.index = [int(x) for x in df2.index]
# check
assert (df - df2).abs().max().max() < 1E-6
This should do the trick if you need it right now, but bear in mind that the function above is very simple, in particular it doesn't do anything about data types.
Upvotes: 0
Reputation: 200
What do you mean by display? Doesn't df['gvkey']
give you the data in the gvkey column?
If what you do is print the whole data frame to the console, then take a look at df.to_string()
, but it'll be hard to read if you have too many columns. Pandas won't print the whole thing by default if you have too many columns:
import pandas
import numpy
df1 = pandas.DataFrame(numpy.random.randn(10, 3), columns=['col%d' % d for d in range(3)] )
df2 = pandas.DataFrame(numpy.random.randn(10, 30), columns=['col%d' % d for d in range(30)] )
print df1 # <--- substitute by df2 to see the difference
print
print df1['col1']
print
print df1.to_string()
Upvotes: 0