Reputation: 2632
I have a file.csv
with ~15k rows that looks like this
SAMPLE_TIME, POS, OFF, HISTOGRAM
2015-07-15 16:41:56, 0-0-0-0-3, 1, 2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,
2015-07-15 16:42:55, 0-0-0-0-3, 1, 0,0,5,9,0,0,0,0,0,2,0,0,0,50,0,
2015-07-15 16:43:55, 0-0-0-0-3, 1, 0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0,
2015-07-15 16:44:56, 0-0-0-0-3, 1, 2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0
I wanted it to be imported to pandas.DataFrame
with any random value given to the column that don't have a header, something like this:
SAMPLE_TIME, POS, OFF, HISTOGRAM 1 2 3 4 5 6
2015-07-15 16:41:56, 0-0-0-0-3, 1, 2, 0, 5, 59, 4, 0, 0,
2015-07-15 16:42:55, 0-0-0-0-3, 1, 0, 0, 5, 0, 6, 0, nan
2015-07-15 16:43:55, 0-0-0-0-3, 1, 0, 0, 5, 0, 7, nan nan
2015-07-15 16:44:56, 0-0-0-0-3, 1, 2, 0, 5, 0, 0, 2, nan
This has been impossible to import, as i tried different solution, such as giving a specific a header, But still no joy, the only way i was able to make it work is to add a header manually in the .csv
file. which kinda defeat the purpose of automation!
Then i tried this solution: Doing this
lines=list(csv.reader(open('file.csv')))
header, values = lines[0], lines[1:]
it correctly reads the files giving me a list of ~15k element values
, each element is a list of string, where each string is correctly parsed data field from the file, but when i try to do this:
data = {h:v for h,v in zip (header, zip(*values))}
df = pd.DataFrame.from_dict(data)
or this:
data2 = {h:v for h,v in zip (str(xrange(16)), zip(*values))}
df2 = pd.DataFrame.from_dict(data)
Then the non headered columns disappear and the order of columns is completely mixed. any idea of a possible solution ?
Upvotes: 7
Views: 6028
Reputation: 180411
You can create columns based on the length of the first actual row:
from tempfile import TemporaryFile
with open("out.txt") as f, TemporaryFile("w+") as t:
h, ln = next(f), len(next(f).split(","))
header = h.strip().split(",")
f.seek(0), next(f)
header += range(ln)
print(pd.read_csv(f, names=header))
Which will give you:
SAMPLE_TIME POS OFF HISTOGRAM 0 1 2 3 \
0 2015-07-15 16:41:56 0-0-0-0-3 1 2 0 5 59 0
1 2015-07-15 16:42:55 0-0-0-0-3 1 0 0 5 9 0
2 2015-07-15 16:43:55 0-0-0-0-3 1 0 0 5 5 0
3 2015-07-15 16:44:56 0-0-0-0-3 1 2 0 5 0 0
4 5 ... 13 14 15 16 17 18 19 20 21 22
0 0 0 ... 0 0 0 0 0 NaN NaN NaN NaN NaN
1 0 0 ... 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 0 0 ... 4 0 0 0 NaN NaN NaN NaN NaN NaN
3 0 0 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
[4 rows x 27 columns]
Or you could clean the file before passing to pandas:
import pandas as pd
from tempfile import TemporaryFile
with open("in.csv") as f, TemporaryFile("w+") as t:
for line in f:
t.write(line.replace(" ", ""))
t.seek(0)
ln = len(line.strip().split(","))
header = t.readline().strip().split(",")
header += range(ln)
print(pd.read_csv(t,names=header))
Which gives you:
SAMPLE_TIME POS OFF HISTOGRAM 0 1 2 3 4 5 ... 11 \
0 2015-07-1516:41:56 0-0-0-0-3 1 2 0 5 59 0 0 0 ... 0
1 2015-07-1516:42:55 0-0-0-0-3 1 0 0 5 9 0 0 0 ... 0
2 2015-07-1516:43:55 0-0-0-0-3 1 0 0 5 5 0 0 0 ... 0
3 2015-07-1516:44:56 0-0-0-0-3 1 2 0 5 0 0 0 0 ... 0
12 13 14 15 16 17 18 19 20
0 0 0 0 0 0 0 NaN NaN NaN
1 50 0 NaN NaN NaN NaN NaN NaN NaN
2 0 4 0 0 0 NaN NaN NaN NaN
3 6 0 0 0 0 NaN NaN NaN NaN
[4 rows x 25 columns]
or to drop the columns will all nana:
print(pd.read_csv(f, names=header).dropna(axis=1,how="all"))
Gives you:
SAMPLE_TIME POS OFF HISTOGRAM 0 1 2 3 \
0 2015-07-15 16:41:56 0-0-0-0-3 1 2 0 5 59 0
1 2015-07-15 16:42:55 0-0-0-0-3 1 0 0 5 9 0
2 2015-07-15 16:43:55 0-0-0-0-3 1 0 0 5 5 0
3 2015-07-15 16:44:56 0-0-0-0-3 1 2 0 5 0 0
4 5 ... 8 9 10 11 12 13 14 15 16 17
0 0 0 ... 2 0 0 0 0 0 0 0 0 0
1 0 0 ... 2 0 0 0 50 0 NaN NaN NaN NaN
2 0 0 ... 2 0 0 0 0 4 0 0 0 NaN
3 0 0 ... 2 0 0 0 6 0 0 0 0 NaN
[4 rows x 22 columns]
Upvotes: 8
Reputation: 862651
You can split column HISTOGRAM
to new DataFrame
and concat
it to original.
print df
SAMPLE_TIME, POS, OFF, \
0 2015-07-15 16:41:56 0-0-0-0-3, 1,
1 2015-07-15 16:42:55 0-0-0-0-3, 1,
2 2015-07-15 16:43:55 0-0-0-0-3, 1,
3 2015-07-15 16:44:56 0-0-0-0-3, 1,
HISTOGRAM
0 2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,
1 0,0,5,9,0,0,0,0,0,2,0,0,0,50,0,
2 0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0,
3 2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0
#create new dataframe from column HISTOGRAM
h = pd.DataFrame([ x.split(',') for x in df['HISTOGRAM'].tolist()])
print h
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 2 0 5 59 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0
1 0 0 5 9 0 0 0 0 0 2 0 0 0 50 0 None None None None
2 0 0 5 5 0 0 0 0 0 2 0 0 0 0 4 0 0 0 None
3 2 0 5 0 0 0 0 0 0 2 0 0 0 6 0 0 0 0 None None
#append to original, rename 0 column
df = pd.concat([df, h], axis=1).rename(columns={0:'HISTOGRAM'})
print df
HISTOGRAM HISTOGRAM 1 2 3 4 5 ... 10 \
0 2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0, 2 0 5 59 0 0 ... 0
1 0,0,5,9,0,0,0,0,0,2,0,0,0,50,0, 0 0 5 9 0 0 ... 0
2 0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0, 0 0 5 5 0 0 ... 0
3 2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0 2 0 5 0 0 0 ... 0
11 12 13 14 15 16 17 18 19
0 0 0 0 0 0 0 0 0
1 0 0 50 0 None None None None
2 0 0 0 4 0 0 0 None
3 0 0 6 0 0 0 0 None None
[4 rows x 24 columns]
Upvotes: 3
Reputation: 2293
So how about this. I made a csv from your sample data.
When I import lines:
with open('test.csv','rb') as f:
lines = list(csv.reader(f))
headers, values =lines[0],lines[1:]
to generate nice header names, use this line:
headers = [i or ind for ind, i in enumerate(headers)]
so because of how (I assume) csv works, headers should have a bunch of empty string values. empty strings evaluate to False, so this comprehension returns numbered columns for each column without a header.
Then just make a df:
df = pd.DataFrame(values,columns=headers)
which looks like:
11: SAMPLE_TIME POS OFF HISTOGRAM 4 5 6 7 8 9 \
0 15/07/2015 16:41 0-0-0-0-3 1 2 0 5 59 0 0 0
1 15/07/2015 16:42 0-0-0-0-3 1 0 0 5 9 0 0 0
2 15/07/2015 16:43 0-0-0-0-3 1 0 0 5 5 0 0 0
3 15/07/2015 16:44 0-0-0-0-3 1 2 0 5 0 0 0 0
... 12 13 14 15 16 17 18 19 20 21
0 ... 2 0 0 0 0 0 0 0 0 0
1 ... 2 0 0 0 50 0
2 ... 2 0 0 0 0 4 0 0 0
3 ... 2 0 0 0 6 0 0 0 0
[4 rows x 22 columns]
Upvotes: -1
Reputation: 269
Assuming your data is in a file called foo.csv, you could do the following. This was tested against Pandas 0.17
df = pd.read_csv('foo.csv', names=['sample_time', 'pos', 'off', 'histogram', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17'], skiprows=1)
Upvotes: -1