Reading a CSV file with irregular number of columns using Pandas

Question

I am trying to read a csv file, which doesn't contain a header line, and it contains an indefinite amount of columns, with pandas.

I have search how to work around this, but all the answers that I have found require for me to already know (search by opening the file) the maximum number that a column can have and create a names= attribute on read_csv function, like this:

names = ["a", "b", "c", "d"]
table = pandas.read_csv('freqs.tsv', header=None, sep='	+', names=names)

My question is, is there any possible ways to do this without knowing the maximum number of columns? For future reusability of the script, I want to generalize if it is possible.

Here is a sample text file I was using to run some tests:

mathematics         1.548438245 1.4661764369999999      1.429891562 
english                     1.237816576 1.043399455
physics         2.415563662 11.165497484000001  5.954598265 7.853732762999999   7.929835858
drama           2.0439384830000003  9.81210385  5.068332477 8.579349377 5.962282599999999
health                      1.557941553 1.222267933
science                         1.550193476
gym             1.240610831 1.149375944 1.899408195 1.3713249980000002

Thank you

Kraigolas · Accepted Answer

I get the following output

	0	1	2	3	4
mathematics	1.54844	1.46618	1.42989	nan	nan
english	1.23782	1.0434	nan	nan	nan
physics	2.41556	11.1655	5.9546	7.85373	7.92984
drama	2.04394	9.8121	5.06833	8.57935	5.96228
health	1.55794	1.22227	nan	nan	nan
science	1.55019	nan	nan	nan	nan
gym	1.24061	1.14938	1.89941	1.37132	nan

By writing:

import pandas as pd 
# Assume your data is in test.txt in the current working directory 
f = open("test.txt", "r")

# This assumes your spacing is arbitrary 
data = [line.split() for line in f]
data = {line[0] : [float(item) for item in line[1:]] for line in data}
# The orient = "index" allows us to handle differing lengths of entries
df = pd.DataFrame.from_dict(data, orient="index")

# this just provides the above table for printing in StackOverflow
print(df.to_markdown())

Note that I've assumed the spacing in your file is arbitrary and hence we don't need to track which columns are empty, we can just split at spaces and keep the values.

Also note that nan means "not a number" and is what you should expect to see in your dataframe if you have rows of differing lengths.

Finally, if you want the subjects as the columns, use df = df.transpose().

Reading a CSV file with irregular number of columns using Pandas

Answers (1)

Related Questions