Reputation: 1821
I have been strugling with an optimization problem with Pandas.
I had developed a script to apply computation on every line of a relatively small DataFrame (~a few 1000s lines, a few dozen columns). I relied heavily on the apply() function which was obviously a poor choice in most cases.
After a round of optimization I only have a method which takes time and I haven't found an easy solution for :
Basically my dataframe contains a list of video viewing statistics with the number of people who watched the video for every quartile (how many have watched 0%, 25%, 50%, etc) such as :
video_name | video_length | video_0 | video_25 | video_50 | video_75 | video_100 |
---|---|---|---|---|---|---|
video_1 | 6 | 1000 | 500 | 300 | 250 | 5 |
video_2 | 30 | 1000 | 500 | 300 | 250 | 5 |
I am trying to interpolate the statistics to be able to answer "how many people would have watched each quartile of the video if it lasted X seconds"
Right now my function takes the dataframe and a "new_length" parameter, and calls apply() on each line.
The function which handles each line computes the time marks for each quartile (so 0, 7.5, 15, 22.5 and 30 for the 30s video), and time marks for each quartile given the new length (so to reduce the 30s video to 6s, the new time marks would be 0, 1.5, 3, 4.5 and 6). I build a dataframe containing the time marks as index, and the stats as values in the first column:
index (time marks) | view_stats |
---|---|
0 | 1000 |
7.5 | 500 |
15 | 300 |
22.5 | 250 |
30 | 5 |
1.5 | NaN |
3 | NaN |
4.5 | NaN |
I then call DataFrame.interpolate(method="index") to fill the NaN values.
It works and gives me the result I expect, but it is taking a whopping 11s for a 3k lines dataframe and I believe it has to do with the use of the apply() method combined with the creation of a new dataframe to interpolate the data for each line.
Is there an obvious way achieve the same result "in place", e.g by avoiding the apply / new dataframe method, directly on the original dataframe ?
EDIT: The expected output when calling the function with 6 as the new length parameter would be :
video_name | video_length | video_0 | video_25 | video_50 | video_75 | video_100 | new_video_0 | new_video_25 | new_video_50 | new_video_75 | new_video_100 |
---|---|---|---|---|---|---|---|---|---|---|---|
video_1 | 6 | 1000 | 500 | 300 | 250 | 5 | 1000 | 500 | 300 | 250 | 5 |
video_2 | 6 | 1000 | 500 | 300 | 250 | 5 | 1000 | 900 | 800 | 700 | 600 |
The first line would be untouched because the video is already 6s long. In the second line, the video would be cut from 30s to 6s so the new quartiles would be at 0, 1.5, 3, 4.5, 6s and the stats would be interpolated between 1000 and 500, which were the values at the old 0% and 25% time marks
EDIT2: I do not care if I need to add temporary columns, time is an issue, memory is not.
As a reference, this is my code :
def get_value(marks, asset, mark_index) -> int:
value = marks["count"][asset["new_length_marks"][mark_index]]
if isinstance(value, pandas.Series):
res = value.iloc(0)
else:
res = value
return math.ceil(res)
def length_update_row(row, assets, **kwargs):
asset_name = row["asset_name"]
asset = assets[asset_name]
# assets is a dict containing the list of files and the old and "new" video marks
# pre-calculated
marks = pandas.DataFrame(data=[int(row["video_start"]), int(row["video_25"]), int(row["video_50"]), int(row["video_75"]), int(row["video_completed"])],
columns=["count"],
index=asset["old_length_marks"])
marks = marks.combine_first(pandas.DataFrame(data=NaN, columns=["count"], index=asset["new_length_marks"][1:]))
marks = marks.interpolate(method="index")
row["video_25"] = get_value(marks, asset, 1)
row["video_50"] = get_value(marks, asset, 2)
row["video_75"] = get_value(marks, asset, 3)
row["video_completed"] = get_value(marks, asset, 4)
return row
def length_update_stats(report: pandas.DataFrame,
assets: dict) -> pandas.DataFrame:
new_report = new_report.apply(lambda row: length_update_row(row, assets), axis=1)
return new_report
Upvotes: 1
Views: 132
Reputation: 61920
IIUC, you could use np.interp:
# get the old x values
xs = df['video_length'].values[:, None] * [0, 0.25, 0.50, 0.75, 1]
# the corresponding y values
ys = df.iloc[:, 2:].values
# note that 6 is the new value
nxs = np.repeat(np.array(6), 2)[:, None] * [0, 0.25, 0.50, 0.75, 1]
res = pd.DataFrame(data=np.array([np.interp(nxi, xi, yi) for nxi, xi, yi in zip(nxs, xs, ys)]), columns="new_" + df.columns[2:] )
print(res)
Output
new_video_0 new_video_25 new_video_50 new_video_75 new_video_100
0 1000.0 500.0 300.0 250.0 5.0
1 1000.0 900.0 800.0 700.0 600.0
And then concat across the second axis:
output = pd.concat((df, res), axis=1)
print(output)
Output (concat)
video_name video_length video_0 ... new_video_50 new_video_75 new_video_100
0 video_1 6 1000 ... 300.0 250.0 5.0
1 video_2 30 1000 ... 800.0 700.0 600.0
[2 rows x 12 columns]
Upvotes: 1