Optimizing interpolation of values in Pandas

Question

I have been strugling with an optimization problem with Pandas.

I had developed a script to apply computation on every line of a relatively small DataFrame (~a few 1000s lines, a few dozen columns). I relied heavily on the apply() function which was obviously a poor choice in most cases.

After a round of optimization I only have a method which takes time and I haven't found an easy solution for :

Basically my dataframe contains a list of video viewing statistics with the number of people who watched the video for every quartile (how many have watched 0%, 25%, 50%, etc) such as :

video_name	video_length	video_0	video_25	video_50	video_75	video_100
video_1	6	1000	500	300	250	5
video_2	30	1000	500	300	250	5

I am trying to interpolate the statistics to be able to answer "how many people would have watched each quartile of the video if it lasted X seconds"

Right now my function takes the dataframe and a "new_length" parameter, and calls apply() on each line.

The function which handles each line computes the time marks for each quartile (so 0, 7.5, 15, 22.5 and 30 for the 30s video), and time marks for each quartile given the new length (so to reduce the 30s video to 6s, the new time marks would be 0, 1.5, 3, 4.5 and 6). I build a dataframe containing the time marks as index, and the stats as values in the first column:

index (time marks)	view_stats
0	1000
7.5	500
15	300
22.5	250
30	5
1.5	NaN
3	NaN
4.5	NaN

I then call DataFrame.interpolate(method="index") to fill the NaN values.

It works and gives me the result I expect, but it is taking a whopping 11s for a 3k lines dataframe and I believe it has to do with the use of the apply() method combined with the creation of a new dataframe to interpolate the data for each line.

Is there an obvious way achieve the same result "in place", e.g by avoiding the apply / new dataframe method, directly on the original dataframe ?

EDIT: The expected output when calling the function with 6 as the new length parameter would be :

video_name	video_length	video_0	video_25	video_50	video_75	video_100	new_video_0	new_video_25	new_video_50	new_video_75	new_video_100
video_1	6	1000	500	300	250	5	1000	500	300	250	5
video_2	6	1000	500	300	250	5	1000	900	800	700	600

The first line would be untouched because the video is already 6s long. In the second line, the video would be cut from 30s to 6s so the new quartiles would be at 0, 1.5, 3, 4.5, 6s and the stats would be interpolated between 1000 and 500, which were the values at the old 0% and 25% time marks

EDIT2: I do not care if I need to add temporary columns, time is an issue, memory is not.

As a reference, this is my code :

def get_value(marks, asset, mark_index) -> int:
  value = marks["count"][asset["new_length_marks"][mark_index]]
  if isinstance(value, pandas.Series):
    res = value.iloc(0)
  else:
    res = value
  return math.ceil(res)

def length_update_row(row, assets, **kwargs):
  asset_name = row["asset_name"]
  asset = assets[asset_name]
  # assets is a dict containing the list of files and the old and "new" video marks
  # pre-calculated

  marks = pandas.DataFrame(data=[int(row["video_start"]), int(row["video_25"]), int(row["video_50"]), int(row["video_75"]), int(row["video_completed"])],
                            columns=["count"],
                            index=asset["old_length_marks"])
    
  marks = marks.combine_first(pandas.DataFrame(data=NaN, columns=["count"], index=asset["new_length_marks"][1:]))
  marks = marks.interpolate(method="index")
    
  row["video_25"] = get_value(marks, asset, 1)
  row["video_50"] = get_value(marks, asset, 2)
  row["video_75"] = get_value(marks, asset, 3)
  row["video_completed"] = get_value(marks, asset, 4)
  
  return row
  

def length_update_stats(report: pandas.DataFrame,
                 assets: dict) -> pandas.DataFrame:
  new_report = new_report.apply(lambda row: length_update_row(row, assets), axis=1)
  return new_report

Dani Mesejo · Accepted Answer

IIUC, you could use np.interp:

# get the old x values
xs = df['video_length'].values[:, None] * [0, 0.25, 0.50, 0.75, 1]

# the corresponding y values
ys = df.iloc[:, 2:].values

# note that 6 is the new value
nxs = np.repeat(np.array(6), 2)[:, None] * [0, 0.25, 0.50, 0.75, 1]

res = pd.DataFrame(data=np.array([np.interp(nxi, xi, yi) for nxi, xi, yi in zip(nxs, xs, ys)]), columns="new_" + df.columns[2:] )

print(res)

Output

   new_video_0  new_video_25  new_video_50  new_video_75  new_video_100
0       1000.0         500.0         300.0         250.0            5.0
1       1000.0         900.0         800.0         700.0          600.0

And then concat across the second axis:

output = pd.concat((df, res), axis=1)
print(output)

Output (concat)

  video_name  video_length  video_0  ...  new_video_50  new_video_75  new_video_100
0    video_1             6     1000  ...         300.0         250.0            5.0
1    video_2            30     1000  ...         800.0         700.0          600.0

[2 rows x 12 columns]

Optimizing interpolation of values in Pandas

Answers (1)

Related Questions