Alex Baden
Alex Baden

Reputation: 11

LLVM not optimizing composition of functions in IR

I have an LLVM based JIT compiler and am having trouble optimizing functions with the pattern func1(func2(x)). The issue can be demonstrated with the following example:

#include <cmath>


extern "C" double transform_4326_900913_x(const double x) {
  return x * 111319.490778;
}

extern "C" double transform_4326_900913_y(const double y) {
  return 6378136.99911 * tan(log(.00872664626 * y + .785398163397));
}

extern "C" void transform(double* out, const double x_in, const double y_in) {
    out[0] = transform_4326_900913_x(x_in);
    out[1] = transform_4326_900913_y(y_in);
}

double program(const double x_in, const double y_in) {
    double arr[2];
    transform(arr, x_in, y_in);
    return arr[0];
}

If I compile this code to LLVM using clang, and don't use fast math, I get:

  %3 = fmul double %0, 0x40FB2D77DA3A083A, !dbg !416
  call void @llvm.dbg.value(metadata double %3, metadata !407, metadata !DIExpression(DW_OP_LLVM_fragment, 0, 64)), !dbg !411
  call void @llvm.dbg.value(metadata double %1, metadata !367, metadata !DIExpression()) #4, !dbg !417
  %4 = fmul double %1, 0x3F81DF46A252DD11, !dbg !419
  %5 = fadd double %4, 0x3FE921FB54441D52, !dbg !420
  %6 = tail call double @log(double %5) #4, !dbg !421
  %7 = tail call double @tan(double %6) #4, !dbg !422
}

This is expected -- llvm does not know whether the stdlib calls log and tan can be removed without affecting the program. If I switch to -ffast-math, then everything is nicely optimized:

define dso_local double @_Z7programdd(double %0, double %1) local_unnamed_addr #1 !dbg !398 {
  %3 = fmul fast double %0, 0x40FB2D77DA3A083A, !dbg !413
  ret double %3, !dbg !414
}

I have a similar problem in my system, following the same pattern as above. Consider the following IR:

ST_X_nullcheck_false:                             ; preds = %groupby_nullcheck_true
  %10 = call i8* @array_buff(i8* %col_buf0, i64 %pos) #14
  %ST_Transform_Array2 = alloca [2 x double], align 8
  %ST_Transform_Array2.sub = getelementptr inbounds [2 x double], [2 x double]* %ST_Transform_Array2, i64 0, i64 0
  %11 = bitcast i8* %10 to i32*
  %compressed_x_coord = load i32, i32* %11, align 4
  %12 = call double @decompress_x_coord_geoint(i32 %compressed_x_coord) #14
  store double %12, double* %ST_Transform_Array2.sub, align 8
  %13 = getelementptr i8, i8* %10, i64 4
  %14 = bitcast i8* %13 to i32*
  %compressed_y_coord = load i32, i32* %14, align 4
  %15 = call double @decompress_y_coord_geoint(i32 %compressed_y_coord) #14
  %16 = getelementptr inbounds [2 x double], [2 x double]* %ST_Transform_Array2, i64 0, i64 1
  store double %15, double* %16, align 8
  %17 = fmul double %12, 0x40FB2D77DA3A083A
  store double %17, double* %ST_Transform_Array2.sub, align 8
  %18 = fmul double %15, 0x3F81DF46A252DD11
  %19 = fadd double %18, 0x3FE921FB54441D52
  %20 = call double @Tan(double %19) #15
  %21 = call double @ln(double %20) #15
  %22 = fmul double %21, 0x415854A63FF16B12
  store double %22, double* %16, align 8
  %x_coord2 = load double, double* %ST_Transform_Array2.sub, align 8
  br label %filter_false
}

filter_false:                                     ; preds = %groupby_nullcheck_true, %ST_X_nullcheck_false
  %ST_X_nullcheck_value = phi double [ %x_coord2, %ST_X_nullcheck_false ], [ 0x10000000000000, %groupby_nullcheck_true ]
  %3 = getelementptr inbounds i64, i64* %6, i64 1
  call void @agg_id_double_shared(i64* nonnull %3, double %ST_X_nullcheck_value)
  ret i32 0

Here, I have manually added attributes ({ nobuiltin nofree norecurse nosync nounwind readnone willreturn }) to the @Tan and @ln functions, and none of the outputs from those functions are used later in the function. If I remove one of the functions (either one), I get the desired optimization -- the entire "y" branch is removed, the alloca is converted to a scalar value, and the scalar is read by the phi in filter_false. But something about the composition of functions seems to be breaking the ability of instruction combine to cull the IR. I've played with a number of combinations of attributes and/or passes, but am beginning to suspect I've overlooked something more basic.

Upvotes: 0

Views: 128

Answers (1)

Alex Baden
Alex Baden

Reputation: 11

I was able to resolve this using only existing LLVM passes.

The key was to add the NewGVN LLVM pass prior to instruction combining. NewGVN removed the load here:

  %x_coord2 = load double, double* %ST_Transform_Array2.sub, align 8

And simply used the output of the multiplication instruction in the corresponding Phi. This allowed instruction combining / dead code elimination passes to declare first the store instructions, then the function calls, dead. The minimum set of passes for this optimization to work (after the custom function annotation pass) ended up being:

pass_manager.add(llvm::createSROAPass());
  // mem ssa drops unused load and store instructions, e.g. passing variables directly
  // where possible
  pass_manager.add(
      llvm::createEarlyCSEPass(/*enable_mem_ssa=*/true));  // Catch trivial redundancies

  pass_manager.add(llvm::createJumpThreadingPass());  // Thread jumps.
  pass_manager.add(llvm::createCFGSimplificationPass());

  pass_manager.add(llvm::createNewGVNPass());

  pass_manager.add(llvm::createInstructionCombiningPass());

In the above sequence.

Upvotes: 1

Related Questions