Reputation: 771
I'm a 'newbie' in python (started learning 2 weeks ago) and I'm trying to plot a file that looks like this:
"1stSerie"
2 23
4 12
6 12
"2ndSerie"
2 51
4 90
6 112
Using any of the following: pandas, matplotlib and numpy. But I am not having much of a success. I tried searching for examples but none applied to my data format.
Can somebody help me find out how to load this file in a pandas dataframe or (what would be even better) show me how to plot this?
Details:
After the help from @Goyo I changed my method convert()
to be something like this:
#!/usr/bin/env python3
def convert(in_file, out_file):
name = ""
for line in in_file:
line = line.strip()
print(line)
if line == "":
continue
if line.startswith('"'):
name = line.strip('"')
print("NAME:: " + name)
else:
out_file.write("{0}\n".format(','.join([name] + line.split("\t")) ) )
To plot I'm using the following code:
with open('nro_caribou.dat') as in_file:
with open('output.txt', 'w+') as out_file:
convert(in_file, out_file)
df = pd.read_csv('output.txt', header=None,names=['Methods', 'Param', 'Time'], sep=",", )
print(df)
df.pivot(values='Time', index='Param', columns='Methods').plot()
My original data: https://gist.github.com/pedro-stanaka/c3eda0aa2191950a8d83
And my plot:
Upvotes: 2
Views: 2623
Reputation: 2690
You can step through the file using itertools.groupby
. The LastHeader class below checks each line for a sentinal character. If the character is there, the headerline is updated, and itertools.groupby
starts a new segment. The only place this runs into trouble with your dataset is where you have two series labeled "CRE". My workaround was to just delete the second one from the textfile, but you'll probably want to do something else.
The upshot here is that you can just injest the data in a single pass. No writing out and reading back in required.
from itertools import groupby
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas import DataFrame, Series
class LastHeader():
"""Checks for new header strings. For use with groupby"""
def __init__(self, sentinel='#'):
self.sentinel = sentinel
self.lastheader = ''
self.index=0
def check(self, line):
self.index += 1
if line.startswith(self.sentinel):
self.lastheader = line
return self.lastheader
fname = 'dist_caribou.dat'
with open(fname, 'r') as fobj:
lastheader = LastHeader('"')
data = []
for headerline, readlines in groupby(fobj, lastheader.check):
name = headerline.strip().strip('"')
thisdat = np.loadtxt(readlines, comments='"')
data.append(Series(thisdat[:, 1], index=thisdat[:, 0], name=name))
data = pd.concat(data, axis=1)
data.plot().set_yscale('log')
plt.show()
Upvotes: 1
Reputation: 862641
I think you can read_csv
only once and then post processing create dataframe
:
import pandas as pd
import io
temp=u""""1stSerie"
2 23
4 12
6 12
"2ndSerie"
2 51
4 90
6 112
"""
s = pd.read_csv(io.StringIO(temp), #after testing replace io.StringIO(temp) to filename
sep="\s+",
engine='python', #because ParserWarning
squeeze=True,
header=None) #try convert output to series
print s
"1stSerie" NaN
2 23
4 12
6 12
"2ndSerie" NaN
2 51
4 90
6 112
Name: 0, dtype: float64
df = s.reset_index()
#set column names
df.columns = ['idx','val']
#try convert column idx to numeric, if string get NaN
print pd.to_numeric(df['idx'], errors='coerce')
0 NaN
1 2
2 4
3 6
4 NaN
5 2
6 4
7 6
Name: idx, dtype: float64
#find NaN - which values are string
print pd.isnull(pd.to_numeric(df['idx'], errors='coerce'))
0 True
1 False
2 False
3 False
4 True
5 False
6 False
7 False
Name: idx, dtype: bool
#this values get to new column names
df.loc[pd.isnull(pd.to_numeric(df['idx'], errors='coerce')), 'names'] = df['idx']
#forward fill NaN values
df['names'] = df['names'].ffill()
#remove values, where column val in NaN
df = df[pd.notnull(df['val'])]
print df
idx val names
1 2 23 "1stSerie"
2 4 12 "1stSerie"
3 6 12 "1stSerie"
5 2 51 "2ndSerie"
6 4 90 "2ndSerie"
7 6 112 "2ndSerie"
df.pivot(index='idx', columns='names', values='val').plot()
Or you can use read_csv
and plot
. If you need set names of Series
to legend
, use figure
and legend
:
import pandas as pd
import matplotlib.pyplot as plt
import io
temp=u""""1stSerie"
2 23
4 12
6 12
"2ndSerie"
2 51
4 90
6 112"""
s1 = pd.read_csv(io.StringIO(temp), #after testing replace io.StringIO(temp) to filename
sep="\s+",
engine='python', #because ParserWarning
nrows=3, #read only 3 rows of data
squeeze=True) #try convert output to series
print s1
2 23
4 12
6 12
Name: "1stSerie", dtype: int64
#after testing replace io.StringIO(temp) to filename
s2 = pd.read_csv(io.StringIO(temp),
sep="\s+",
header=4, #read row 4 to header - series name
engine='python',
nrows=3,
squeeze=True)
print s2
2 51
4 90
6 112
Name: "2ndSerie", dtype: int64
plt.figure()
s1.plot()
ax = s2.plot()
ax.legend(['1stSerie','2ndSerie'])
Or you can read file only once and then cut Serie
s
to Series
s1
, s2
and s3
and then create DataFrame
:
import pandas as pd
import matplotlib.pyplot as plt
import io
temp=u""""1stSerie"
2 23
4 12
6 12
"2ndSerie"
2 51
4 90
6 112
"3rdSerie"
2 51
4 90
6 112
"""
s = pd.read_csv(io.StringIO(temp), #after testing replace io.StringIO(temp) to filename
sep="\s+",
engine='python', #because ParserWarning
squeeze=True) #try convert output to series
print s
2 23
4 12
6 12
"2ndSerie" NaN
2 51
4 90
6 112
"3rdSerie" NaN
2 51
4 90
6 112
Name: "1stSerie", dtype: float64
s1 = s[:3]
print s1
2 23
4 12
6 12
Name: "1stSerie", dtype: float64
s2 = s[4:7]
s2.name='2ndSerie'
print s2
2 51
4 90
6 112
Name: 2ndSerie, dtype: float64
s3 = s[8:]
s3.name='3rdSerie'
print s3
2 51
4 90
6 112
Name: 3rdSerie, dtype: float64
print pd.DataFrame({'a': s1, 'b': s2, 'c': s3})
a b c
2 23 51 51
4 12 90 90
6 12 112 112
Upvotes: 1
Reputation: 109546
Given the appropriate parameters for read_csv
in pandas, this is relatively trivial to plot.
s1 = pd.read_csv('series1.txt',
index_col=0,
sep=" ",
squeeze=True,
header=0,
skipinitialspace=True)
>>> s1
tSerie
2 23
4 12
6 12
Name: Unnamed: 1, dtype: int64
s2 = pd.read_csv('series2.txt',
index_col=0,
sep=" ",
squeeze=True,
header=0,
skipinitialspace=True)
%matplotlib inline # If not already enabled.
s1.plot();s2.plot()
Upvotes: 0
Reputation: 12610
AFAIK there's no builtin features in pandas, matplotlib or numpy to read files like that one. If you have some control on the data format I encourage you to change it.
If you have no options but using that format, you can parse the data yourself using just the python I/O and string manipulation features (I do not think pandas can make this easier, it is not designed to deal with these kind of files).
This function can convert data from your format to another more suitable for pandas:
def convert(in_file, out_file):
for line in in_file:
line = line.rstrip(' \n\r')
if not line:
continue
if line.startswith('"'):
name = line.strip('"')
else:
out_file.write('{}\n'.format(','.join([name] + line.split())))
If your original file is 'input.txt' you would use it this way:
with open('input.txt') as in_file:
with open('output.txt', 'w') as out_file:
convert(in_file, out_file)
df = pd.read_csv('output.txt', header=None,
names=['Series', 'X', 'Y'])
print(df)
Series X Y
0 1st Serie 2 23
1 1st Serie 4 12
2 1st Serie 6 12
3 2nd Serie 2 51
4 2nd Serie 4 90
5 2nd Serie 6 112
df.pivot(index='X', columns='Series', values='Y').plot()
Upvotes: 2