Reputation: 2046
I'm parsing a file that has chronologically timestamped data for multiple time series that I would like to parse in python and then use matplotlib to create a single line plot with independent lines for each set of time series data. The data I'm parsing looks something like this:
time label value
1.05 seriesA 3.925
1.09 seriesC 0.245
2.13 seriesB 12.32
2.73 seriesC 4.921
I've parsed the file into a dictionary of lists that contain (time,value) tuples keyed on the series label. I'm struggling with how to get from this to a single line plot with all this data. I want independent lines for seriesA, seriesB, seriesC, etc. on a single plot. Any pointers?
Edit: As requested the dictionary is below. I had a hard time figuring out the best way to store this data so maybe the data structure I'm using is also a problem. The keys below are the different time series labels and the values are a list of (time,value) tuples. In any case, here it is:
{'client1': [(861.991698574, 298189000.0), (862.000768158, 0.0)],
'client2': [(861.781502324, 0.0), (861.78903722, 153600000.0),
(862.281483262, 0.0), (862.289038158, 153600000.0)], 'client3':
[(862.004470762, 3295674368.0), (862.004563939, 3295674368.0),
(862.03981821, 799014912.0), (862.040403314, 1599078400.0),
(862.540269616, 3295674368.0), (862.55133097, 1599078400.0)]}
Upvotes: 3
Views: 12334
Reputation: 61104
Short answer:
Highlight and ctrl+c the data below:
label time value
client1 861.991699 2.981890e+08
client1 862.000768 0.000000e+00
client2 861.781502 0.000000e+00
client2 861.789037 1.536000e+08
client2 862.281483 0.000000e+00
client2 862.289038 1.536000e+08
client3 862.004471 3.295674e+09
client3 862.004564 3.295674e+09
client3 862.039818 7.990149e+08
client3 862.040403 1.599078e+09
client3 862.540270 3.295674e+09
client3 862.551331 1.599078e+09
Then run this snippet:
# imports
import pandas as pd
# read data from the clipboard
df = pd.read_clipboard(sep='\\s+')
# reshape the data to get values by time for each label
df = df.pivot(index='time', columns='label', values='value')
# Replace nans by forward filling existing values
df = df.fillna(method = 'ffill')
# You'll still have to handle the missing values in the beginning of the coloumns
df = df.fillna(method = 'bfill')
# A simple plot:
df.plot()
Then you'll get:
The Details
There are a few confusing elements in this question. If your source data is, as you say, of the form:
time label value
1.05 seriesA 3.925
1.09 seriesC 0.245
2.13 seriesB 12.32
2.73 seriesC 4.921
But the true content of your data is:
{'client1': [(861.991698574, 298189000.0), (862.000768158, 0.0)],
'client2': [(861.781502324, 0.0), (861.78903722, 153600000.0),
(862.281483262, 0.0), (862.289038158, 153600000.0)], 'client3':
[(862.004470762, 3295674368.0), (862.004563939, 3295674368.0),
(862.03981821, 799014912.0), (862.040403314, 1599078400.0),
(862.540269616, 3295674368.0), (862.55133097, 1599078400.0)]}
Then the true content AND form of your data should be:
label time value
client1 861.991699 2.981890e+08
client1 862.000768 0.000000e+00
client2 861.781502 0.000000e+00
client2 861.789037 1.536000e+08
client2 862.281483 0.000000e+00
client2 862.289038 1.536000e+08
client3 862.004471 3.295674e+09
client3 862.004564 3.295674e+09
client3 862.039818 7.990149e+08
client3 862.040403 1.599078e+09
client3 862.540270 3.295674e+09
client3 862.551331 1.599078e+09
In any case, there is absolutely no reason to utilize a dictionary to obtain your
[...]single line plot with all this data. I want independent lines for seriesA, seriesB, seriesC, etc. on a single plot.
I believe the most efficient approach would be Reshaping and Pivot Tables from the pandas docs. From there you can plot the data directly using df.plot()
.
Highlight and ctrl+c the data above, and you're good to go:
# imports
import pandas as pd
# read data from the clipboard
df = pd.read_clipboard(sep='\\s+')
# reshape the data to get values by time for each label
df = df.pivot(index='time', columns='label', values='value')
print(df)
This should represent the desired form of your data:
label client1 client2 client3
time
861.781502 NaN 0.0 NaN
861.789037 NaN 153600000.0 NaN
861.991699 298189000.0 NaN NaN
862.000768 0.0 NaN NaN
862.004471 NaN NaN 3.295674e+09
862.004564 NaN NaN 3.295674e+09
862.039818 NaN NaN 7.990149e+08
862.040403 NaN NaN 1.599078e+09
862.281483 NaN 0.0 NaN
862.289038 NaN 153600000.0 NaN
862.540270 NaN NaN 3.295674e+09
862.551331 NaN NaN 1.599078e+09
There are still a few issues to be handled given the somewhat peculiar time index. To make this data plot-friendly, we should handle the missing values. This is easily done in the next snippet using df.fillna
from the pandas docs:
# Replace nans by forward filling existing values
df = df.fillna(method = 'ffill')
# You'll still have to handle the missing values
# in the beginning of the coloumns
df = df.fillna(method = 'bfill')
Now you'll get a line chart simply by using df.plot()
:
Edit:
Let me know what your data source is in order to give you a few tips on how to read and store your data. Again, pandas and is most likely the way to go.
Upvotes: 1
Reputation: 51335
I like pandas for this type of problem.
First, put the data in a pandas
dataframe:
import pandas as pd
data = {'client1': [(861.991698574, 298189000.0), (862.000768158, 0.0)],
'client2': [(861.781502324, 0.0), (861.78903722, 153600000.0),
(862.281483262, 0.0), (862.289038158, 153600000.0)], 'client3':
[(862.004470762, 3295674368.0), (862.004563939, 3295674368.0),
(862.03981821, 799014912.0), (862.040403314, 1599078400.0),
(862.540269616, 3295674368.0), (862.55133097, 1599078400.0)]}
time = []
label = []
value = []
for k, v in data.items():
for tup in v:
label.append(k)
time.append(tup[0])
value.append(tup[1])
df = pd.DataFrame({'time':time, 'label':label, 'value':value})
Resulting in this dataframe:
>>> df
label time value
0 client1 861.991699 2.981890e+08
1 client1 862.000768 0.000000e+00
2 client2 861.781502 0.000000e+00
3 client2 861.789037 1.536000e+08
4 client2 862.281483 0.000000e+00
5 client2 862.289038 1.536000e+08
6 client3 862.004471 3.295674e+09
7 client3 862.004564 3.295674e+09
8 client3 862.039818 7.990149e+08
9 client3 862.040403 1.599078e+09
10 client3 862.540270 3.295674e+09
11 client3 862.551331 1.599078e+09
Then, you can do this:
by_label = df.groupby('label')
for name, group in by_label:
plt.plot(group['time'], group['value'], label=name)
plt.legend()
plt.show
Regarding how you should store your data in a dictionary; There are different ways to go about this, but if I were you, and to be able to use your data easily with pandas
, I would use a dictionary of the form:
data = {'label':['client1', 'client1', 'client2', ...],
'time':[time1, time2, time3, ...],
'value':[value1, value2, value3, ...]}
making sure all your lists are ordered in the proper way (index 0 of all 3 keys is row 0 of your dataframe, index 1 is row 1, etc...). Then to import into pandas, all you would need to do is df = pd.DataFrame(data)
Upvotes: 6