Reputation: 551
i want to draw a fairly small IoT-CSV-Dataset, about ~2gb. It has the following dimensions (~20.000, ~18.000). Each column should become a subplot, with it's own y axis. I use the following code to generate the picture:
times = pd.date_range('2012-10-01', periods=2000, freq='2min')
timeseries_array = np.array(times);
cols = random.sample(range(1, 2001), 2000)
values = []
for col in cols:
values.append(random.sample(range(1,2001), 2000))
time = pd.DataFrame(data=timeseries_array, columns=['date'])
graph = pd.DataFrame(data=values, columns=cols, index=timeseries_array)
fig, axarr = plt.subplots(len(graph.columns), sharex=True, sharey=True,
constrained_layout=True, figsize=(50,50))
fig.autofmt_xdate()
for i, ax in enumerate(axarr):
ax.plot(time['date'], graph[graph.columns[i]].values)
ax.set(ylabel=graph.columns[i])
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
myFmt = mdates.DateFormatter('%d.%m.%Y %H:%M')
ax.xaxis.set_major_formatter(myFmt)
ax.label_outer()
print('--save-fig--')
plt.savefig(name, dpi=500)
plt.close()
But this is so incredible slow, for 100 subplots it took ~1 min, for 2000 around 20 min. Well my machine has 10 cores and 35 gb ram actually. Have you any hints for me to speed up the process? Is it possible to do multithreading? As i can see this only use one core. Are there some tricks to only draw relevant things? Or is there an alternative method to draw this plot faster, all in one figure without subplots?
Upvotes: 3
Views: 3534
Reputation: 551
Thanks to @Asmus, i came up with this solution, brought me down from 20 mins to 40 secs for (2000,2000). I did not find any good well-documented solution for beginners like me, so i post mine here, used for timeseries and a huge number of columns:
def print_image_fast(name="default.png", graph=[]):
int_columns = len(graph.columns)
#enlarge our figure for every 1000 columns by 30 inch, function well with 500 dpi labelsize 2 and linewidth 0.1
y_size = (int_columns / 1000) * 30
fig = plt.figure(figsize=(10, y_size))
ax = fig.add_subplot(1, 1, 1)
#set_time_formatter for timeseries
myFmt = mdates.DateFormatter('%d.%m.%Y %H:%M')
ax.xaxis.set_major_formatter(myFmt)
#store the label offsets
y_label_offsets = []
current = 0
for i, col in enumerate(graph.columns):
#last max height of the column before
last = current
#current max value of the column and therefore the max height on y
current = np.amax(graph[col].values)
if i == 0:
#y_offset to move the graph along the y axis, starting with column 0 the offset is 0
y_offset = 0
else:
#add the last y_offset (aggregated y_offset from the columns before) + the last offset + 1 is our new Y - zero point to start drawing the new graph
y_offset = y_offset + last + 1
#our label offset is always our current y_offset + half of our height (half of current max value)
y_offset_label = y_offset + (current / 2)
#append label position to array
y_label_offsets.append(y_offset_label)
#plot our graph according to our offset
ax.plot(graph.index.values, graph[col].values + y_offset,
'r-o', ms=0.1, mew=0, mfc='r', linewidth=0.1)
#set boundaries of our chart, last y_offset + full current is our limit for our y-value
ax.set_ylim([0, y_offset+current])
#set boundaries for our timeseries, first and last value
ax.set_xlim([graph.index.values[0], graph.index.values[-1]])
#print columns with computed positions to y axis
plt.yticks(y_label_offsets, graph.columns, fontsize=2)
#print our timelabels on x axis
plt.xticks(fontsize=15, rotation=90)
plt.savefig(name, dpi=500)
plt.close()
//Edit: For anybody interested, a dataframe with (20k,20k) polutes my ram with around ~20gb. And i had to change savefig to svg, because Agg can't handle dimensions greater than 2^16 pixels
Upvotes: 1