Reputation: 303
I am trying to read a csv file, which doesn't contain a header line, and it contains an indefinite amount of columns, with pandas.
I have search how to work around this, but all the answers that I have found require for me to already know (search by opening the file) the maximum number that a column can have and create a names=
attribute on read_csv
function, like this:
names = ["a", "b", "c", "d"]
table = pandas.read_csv('freqs.tsv', header=None, sep='\t+', names=names)
My question is, is there any possible ways to do this without knowing the maximum number of columns? For future reusability of the script, I want to generalize if it is possible.
Here is a sample text file I was using to run some tests:
mathematics 1.548438245 1.4661764369999999 1.429891562
english 1.237816576 1.043399455
physics 2.415563662 11.165497484000001 5.954598265 7.853732762999999 7.929835858
drama 2.0439384830000003 9.81210385 5.068332477 8.579349377 5.962282599999999
health 1.557941553 1.222267933
science 1.550193476
gym 1.240610831 1.149375944 1.899408195 1.3713249980000002
Thank you
Upvotes: 1
Views: 2923
Reputation: 5560
I get the following output
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
mathematics | 1.54844 | 1.46618 | 1.42989 | nan | nan |
english | 1.23782 | 1.0434 | nan | nan | nan |
physics | 2.41556 | 11.1655 | 5.9546 | 7.85373 | 7.92984 |
drama | 2.04394 | 9.8121 | 5.06833 | 8.57935 | 5.96228 |
health | 1.55794 | 1.22227 | nan | nan | nan |
science | 1.55019 | nan | nan | nan | nan |
gym | 1.24061 | 1.14938 | 1.89941 | 1.37132 | nan |
By writing:
import pandas as pd
# Assume your data is in test.txt in the current working directory
f = open("test.txt", "r")
# This assumes your spacing is arbitrary
data = [line.split() for line in f]
data = {line[0] : [float(item) for item in line[1:]] for line in data}
# The orient = "index" allows us to handle differing lengths of entries
df = pd.DataFrame.from_dict(data, orient="index")
# this just provides the above table for printing in StackOverflow
print(df.to_markdown())
Note that I've assumed the spacing in your file is arbitrary and hence we don't need to track which columns are empty, we can just split at spaces and keep the values.
Also note that nan
means "not a number" and is what you should expect to see in your dataframe if you have rows of differing lengths.
Finally, if you want the subjects as the columns, use df = df.transpose()
.
Upvotes: 2