pedrostanaka
pedrostanaka

Reputation: 771

Python Matplotlib - Multiple series in same data file

I'm a 'newbie' in python (started learning 2 weeks ago) and I'm trying to plot a file that looks like this:

"1stSerie"    
2 23
4 12
6 12

"2ndSerie"
2 51
4 90
6 112

Using any of the following: pandas, matplotlib and numpy. But I am not having much of a success. I tried searching for examples but none applied to my data format.

Can somebody help me find out how to load this file in a pandas dataframe or (what would be even better) show me how to plot this?

Details:

UPDATE 1:

After the help from @Goyo I changed my method convert() to be something like this:

#!/usr/bin/env python3
def convert(in_file, out_file):
name = ""
for line in in_file:
    line = line.strip()
    print(line)
    if line == "":
        continue
    if line.startswith('"'):
        name = line.strip('"')
        print("NAME:: " + name)
    else:
        out_file.write("{0}\n".format(','.join([name] + line.split("\t")) ) )

To plot I'm using the following code:

with open('nro_caribou.dat') as in_file:
    with open('output.txt', 'w+') as out_file:
       convert(in_file, out_file)
    df = pd.read_csv('output.txt', header=None,names=['Methods', 'Param', 'Time'], sep=",", )
    print(df)
    df.pivot(values='Time', index='Param', columns='Methods').plot()

My original data: https://gist.github.com/pedro-stanaka/c3eda0aa2191950a8d83

And my plot:

the final chart

Upvotes: 2

Views: 2623

Answers (4)

Elliot
Elliot

Reputation: 2690

You can step through the file using itertools.groupby. The LastHeader class below checks each line for a sentinal character. If the character is there, the headerline is updated, and itertools.groupby starts a new segment. The only place this runs into trouble with your dataset is where you have two series labeled "CRE". My workaround was to just delete the second one from the textfile, but you'll probably want to do something else.

The upshot here is that you can just injest the data in a single pass. No writing out and reading back in required.

from itertools import groupby
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from pandas import DataFrame, Series

class LastHeader():
    """Checks for new header strings. For use with groupby"""
    def __init__(self, sentinel='#'):
        self.sentinel = sentinel
        self.lastheader = ''
        self.index=0

    def check(self, line):
        self.index += 1
        if line.startswith(self.sentinel):
            self.lastheader = line
        return self.lastheader

fname = 'dist_caribou.dat'

with open(fname, 'r') as fobj:
    lastheader = LastHeader('"')
    data = []
    for headerline, readlines in groupby(fobj, lastheader.check):
        name = headerline.strip().strip('"')
        thisdat = np.loadtxt(readlines, comments='"')
        data.append(Series(thisdat[:, 1], index=thisdat[:, 0], name=name))
data = pd.concat(data, axis=1)
data.plot().set_yscale('log')
plt.show()

Upvotes: 1

jezrael
jezrael

Reputation: 862641

I think you can read_csv only once and then post processing create dataframe:

import pandas as pd
import io

temp=u""""1stSerie"    
2 23
4 12
6 12

"2ndSerie"
2 51
4 90
6 112
"""

s = pd.read_csv(io.StringIO(temp), #after testing replace io.StringIO(temp) to filename
                 sep="\s+", 
                 engine='python', #because ParserWarning
                 squeeze=True,
                 header=None) #try convert output to series

print s

"1stSerie"    NaN
2              23
4              12
6              12
"2ndSerie"    NaN
2              51
4              90
6             112
Name: 0, dtype: float64

df = s.reset_index()
#set column names
df.columns = ['idx','val']
#try convert column idx to numeric, if string get NaN
print pd.to_numeric(df['idx'], errors='coerce')
0   NaN
1     2
2     4
3     6
4   NaN
5     2
6     4
7     6
Name: idx, dtype: float64

#find NaN - which values are string
print pd.isnull(pd.to_numeric(df['idx'], errors='coerce'))
0     True
1    False
2    False
3    False
4     True
5    False
6    False
7    False
Name: idx, dtype: bool

#this values get to new column names
df.loc[pd.isnull(pd.to_numeric(df['idx'], errors='coerce')), 'names'] = df['idx']

#forward fill NaN values
df['names'] = df['names'].ffill()

#remove values, where column val in NaN
df = df[pd.notnull(df['val'])]
print df
  idx  val       names
1   2   23  "1stSerie"
2   4   12  "1stSerie"
3   6   12  "1stSerie"
5   2   51  "2ndSerie"
6   4   90  "2ndSerie"
7   6  112  "2ndSerie"

df.pivot(index='idx', columns='names', values='val').plot()

graphs

Or you can use read_csv and plot. If you need set names of Series to legend, use figure and legend:

import pandas as pd
import matplotlib.pyplot as plt
import io

temp=u""""1stSerie"    
2 23
4 12
6 12

"2ndSerie"
2 51
4 90
6 112"""
s1 = pd.read_csv(io.StringIO(temp), #after testing replace io.StringIO(temp) to filename
                 sep="\s+", 
                 engine='python', #because ParserWarning
                 nrows=3, #read only 3 rows of data
                 squeeze=True) #try convert output to series

print s1
2    23
4    12
6    12
Name: "1stSerie", dtype: int64

#after testing replace io.StringIO(temp) to filename
s2 = pd.read_csv(io.StringIO(temp), 
                 sep="\s+", 
                 header=4, #read row 4 to header - series name
                 engine='python',      
                 nrows=3,
                 squeeze=True)

print s2
2     51
4     90
6    112
Name: "2ndSerie", dtype: int64

plt.figure()
s1.plot()
ax = s2.plot()
ax.legend(['1stSerie','2ndSerie'])

graph

Or you can read file only once and then cut Serie s to Series s1, s2 and s3 and then create DataFrame:

import pandas as pd
import matplotlib.pyplot as plt
import io

temp=u""""1stSerie"    
2 23
4 12
6 12

"2ndSerie"
2 51
4 90
6 112

"3rdSerie"
2 51
4 90
6 112
"""
s = pd.read_csv(io.StringIO(temp), #after testing replace io.StringIO(temp) to filename
                 sep="\s+", 
                 engine='python', #because ParserWarning
                 squeeze=True) #try convert output to series

print s
2              23
4              12
6              12
"2ndSerie"    NaN
2              51
4              90
6             112
"3rdSerie"    NaN
2              51
4              90
6             112
Name: "1stSerie", dtype: float64
s1 = s[:3]
print s1
2    23
4    12
6    12
Name: "1stSerie", dtype: float64

s2 = s[4:7]
s2.name='2ndSerie'
print s2
2     51
4     90
6    112
Name: 2ndSerie, dtype: float64

s3 = s[8:]
s3.name='3rdSerie'
print s3
2     51
4     90
6    112
Name: 3rdSerie, dtype: float64

print pd.DataFrame({'a': s1, 'b': s2, 'c': s3})
    a    b    c
2  23   51   51
4  12   90   90
6  12  112  112

Upvotes: 1

Alexander
Alexander

Reputation: 109546

Given the appropriate parameters for read_csv in pandas, this is relatively trivial to plot.

s1 = pd.read_csv('series1.txt', 
                 index_col=0, 
                 sep=" ", 
                 squeeze=True, 
                 header=0, 
                 skipinitialspace=True)
>>> s1
tSerie
2    23
4    12
6    12
Name: Unnamed: 1, dtype: int64

s2 = pd.read_csv('series2.txt', 
                 index_col=0, 
                 sep=" ", 
                 squeeze=True, 
                 header=0, 
                 skipinitialspace=True)

%matplotlib inline  # If not already enabled.
s1.plot();s2.plot()

enter image description here

Upvotes: 0

Stop harming Monica
Stop harming Monica

Reputation: 12610

AFAIK there's no builtin features in pandas, matplotlib or numpy to read files like that one. If you have some control on the data format I encourage you to change it.

If you have no options but using that format, you can parse the data yourself using just the python I/O and string manipulation features (I do not think pandas can make this easier, it is not designed to deal with these kind of files).

This function can convert data from your format to another more suitable for pandas:

def convert(in_file, out_file):
    for line in in_file:
        line = line.rstrip(' \n\r') 
        if not line:
            continue
        if line.startswith('"'):
            name = line.strip('"')
        else:
            out_file.write('{}\n'.format(','.join([name] + line.split())))

If your original file is 'input.txt' you would use it this way:

with open('input.txt') as in_file:
    with open('output.txt', 'w') as out_file:
        convert(in_file, out_file)
df = pd.read_csv('output.txt', header=None,
                 names=['Series', 'X', 'Y'])
print(df)

      Series  X    Y
0  1st Serie  2   23
1  1st Serie  4   12
2  1st Serie  6   12
3  2nd Serie  2   51
4  2nd Serie  4   90
5  2nd Serie  6  112

df.pivot(index='X', columns='Series', values='Y').plot()

enter image description here

Upvotes: 2

Related Questions