Cristian
Cristian

Reputation: 303

Reading a CSV file with irregular number of columns using Pandas

I am trying to read a csv file, which doesn't contain a header line, and it contains an indefinite amount of columns, with pandas.

I have search how to work around this, but all the answers that I have found require for me to already know (search by opening the file) the maximum number that a column can have and create a names= attribute on read_csv function, like this:

names = ["a", "b", "c", "d"]
table = pandas.read_csv('freqs.tsv', header=None, sep='\t+', names=names)

My question is, is there any possible ways to do this without knowing the maximum number of columns? For future reusability of the script, I want to generalize if it is possible.

Here is a sample text file I was using to run some tests:

mathematics         1.548438245 1.4661764369999999      1.429891562 
english                     1.237816576 1.043399455
physics         2.415563662 11.165497484000001  5.954598265 7.853732762999999   7.929835858
drama           2.0439384830000003  9.81210385  5.068332477 8.579349377 5.962282599999999
health                      1.557941553 1.222267933
science                         1.550193476
gym             1.240610831 1.149375944 1.899408195 1.3713249980000002

Thank you

Upvotes: 1

Views: 2923

Answers (1)

Kraigolas
Kraigolas

Reputation: 5560

I get the following output

0 1 2 3 4
mathematics 1.54844 1.46618 1.42989 nan nan
english 1.23782 1.0434 nan nan nan
physics 2.41556 11.1655 5.9546 7.85373 7.92984
drama 2.04394 9.8121 5.06833 8.57935 5.96228
health 1.55794 1.22227 nan nan nan
science 1.55019 nan nan nan nan
gym 1.24061 1.14938 1.89941 1.37132 nan

By writing:

import pandas as pd 
# Assume your data is in test.txt in the current working directory 
f = open("test.txt", "r")

# This assumes your spacing is arbitrary 
data = [line.split() for line in f]
data = {line[0] : [float(item) for item in line[1:]] for line in data}
# The orient = "index" allows us to handle differing lengths of entries
df = pd.DataFrame.from_dict(data, orient="index")

# this just provides the above table for printing in StackOverflow
print(df.to_markdown()) 

Note that I've assumed the spacing in your file is arbitrary and hence we don't need to track which columns are empty, we can just split at spaces and keep the values.

Also note that nan means "not a number" and is what you should expect to see in your dataframe if you have rows of differing lengths.

Finally, if you want the subjects as the columns, use df = df.transpose().

Upvotes: 2

Related Questions