Reputation: 305
I have a data set that contains information from an experiment about particles. You can find it here (hope links are ok, if not let me know and i'll remove immediately) :
http://archive.ics.uci.edu/ml/datasets/MiniBooNE+particle+identification
Trying to read this set in pandas and im encountering the problem of pandas reading this txt as a data frame with 130.064 lines, which is correct, but 1 column. If you check the txt file in the link, you will see that it is in a weird shape, with spaces in the beginning and then 2 spaces between each column. I tried the command
df = pd.read_csv("path/file.txt", header = None)
and also
df = pd.read_csv("path/file.txt", sep = " ", header = None)
where I set 2 spaces as the separator. Nothing works. The file also, in the 1st line, has 2 numbers that just represent the number of rows, which I deleted. For someone who can't/doesn't want to open the link or the data set, here is a picture of some columns.
This is just a portion of it and not the whole data. In the leftmost side, there are 2 spaces between the edge of the window and the first column, as I said. When reading it using pandas this is what I get
Any advice/help would be appreciated. Thanks
EDIT I tried doing the following and I think it worked. First I imported the .txt file using NumPy, after deleting the first row from the data frame which contains the two irrelevant numbers.
df1 = np.loadtxt("path/file.txt")
This, for some reason, worked and the resulting array was correct. Then I converted this array to data frame using the command
df = pd.DataFrame(df1)
df.columns = ['X' + str(x) for x in range(50) ]
And yeah, I think it works. Check the following picture.
I think its correct but if you guys find something wrong let me know.
Upvotes: 0
Views: 1844
Reputation: 1066
Edited
columns = ['Obs1','Obs2','Obs3','Obs4','Obs5','Obs6','Obs7','Obs8','Obs9','Obs10','Obs11','Obs12','Obs13','Obs14','Obs15','Obs16','Obs17','Obs18','Obs19','Obs20','Obs21','Obs22','Obs23','Obs24','Obs25','Obs26','Obs27','Obs28','Obs29','Obs30','Obs31','Obs32','Obs33','Obs34','Obs35','Obs36','Obs37','Obs38','Obs39','Obs40','Obs41','Obs42','Obs43','Obs44','Obs45','Obs46','Obs47','Obs48','Obs49','Obs50']
df = pd.read_csv("path/file.txt", sep = " ", columns=columns , skiprows=1)
Upvotes: 3
Reputation: 802
You could try creating the dataframe from lists instead of the txt file, something like the following:
#We put all the lines in a list
data = []
with open("dataset.txt") as fp:
lines = fp.read()
data = lines.split('\n')
df_data= []
for item in data:
df_data.append(item.split(' ')) #I cant see if 1 space or 2 separate the values
#df_data should be something like [[row1col1,row1col2,row1col3],[row2col1,row2col2,row3col3]]
#List to dataframe
df = pd.DataFrame(df_data)
Doing this by memory so watch out for syntax, hope this helps!
Upvotes: 1