thenac
thenac

Reputation: 305

Can't read .txt file with pandas because it's in a weird shape

I have a data set that contains information from an experiment about particles. You can find it here (hope links are ok, if not let me know and i'll remove immediately) :

http://archive.ics.uci.edu/ml/datasets/MiniBooNE+particle+identification

Trying to read this set in pandas and im encountering the problem of pandas reading this txt as a data frame with 130.064 lines, which is correct, but 1 column. If you check the txt file in the link, you will see that it is in a weird shape, with spaces in the beginning and then 2 spaces between each column. I tried the command

df = pd.read_csv("path/file.txt", header = None)

and also

df = pd.read_csv("path/file.txt", sep = "  ", header = None)

where I set 2 spaces as the separator. Nothing works. The file also, in the 1st line, has 2 numbers that just represent the number of rows, which I deleted. For someone who can't/doesn't want to open the link or the data set, here is a picture of some columns.

enter image description here

This is just a portion of it and not the whole data. In the leftmost side, there are 2 spaces between the edge of the window and the first column, as I said. When reading it using pandas this is what I get

enter image description here

Any advice/help would be appreciated. Thanks


EDIT I tried doing the following and I think it worked. First I imported the .txt file using NumPy, after deleting the first row from the data frame which contains the two irrelevant numbers.

df1 = np.loadtxt("path/file.txt")

This, for some reason, worked and the resulting array was correct. Then I converted this array to data frame using the command

df = pd.DataFrame(df1)
df.columns = ['X' + str(x) for x in range(50) ]

And yeah, I think it works. Check the following picture. enter image description here

I think its correct but if you guys find something wrong let me know.

Upvotes: 0

Views: 1844

Answers (2)

Shenanigator
Shenanigator

Reputation: 1066

Edited

columns = ['Obs1','Obs2','Obs3','Obs4','Obs5','Obs6','Obs7','Obs8','Obs9','Obs10','Obs11','Obs12','Obs13','Obs14','Obs15','Obs16','Obs17','Obs18','Obs19','Obs20','Obs21','Obs22','Obs23','Obs24','Obs25','Obs26','Obs27','Obs28','Obs29','Obs30','Obs31','Obs32','Obs33','Obs34','Obs35','Obs36','Obs37','Obs38','Obs39','Obs40','Obs41','Obs42','Obs43','Obs44','Obs45','Obs46','Obs47','Obs48','Obs49','Obs50']    
df = pd.read_csv("path/file.txt", sep = "  ", columns=columns , skiprows=1)

Upvotes: 3

Manuel
Manuel

Reputation: 802

You could try creating the dataframe from lists instead of the txt file, something like the following:

#We put all the lines in a list
data = []
with open("dataset.txt") as fp:
    lines = fp.read()
    data = lines.split('\n')

df_data= []
for item in data:
    df_data.append(item.split('  ')) #I cant see if 1 space or 2 separate the values

#df_data should be something like [[row1col1,row1col2,row1col3],[row2col1,row2col2,row3col3]]

#List to dataframe

df = pd.DataFrame(df_data)

Doing this by memory so watch out for syntax, hope this helps!

Upvotes: 1

Related Questions