Kao
Kao

Reputation: 7364

How to exclude headers from AST in clang?

I'm generating AST using clang. I've got following file (lambda.cpp) to parse:

#include <iostream>

void my_lambda()
{
    auto lambda = [](auto x, auto y) {return x + y;};
    std::cout << "fabricati diem"; 
}

I'm parsing this using following command:

clang -Xclang -ast-dump -fsyntax-only lambda.cpp

The problem is that clang parses also headers content. As a result, I've got quite big (~3000 lines) file with useless (for me) content.

How to exclude headers when generating AST?

Upvotes: 26

Views: 12864

Answers (5)

Razvi
Razvi

Reputation: 398

I'm facing the same problem. My context is that I need to parse the AST in JSON format, and I'd like to get rid of all the headers and unnecessary files. I tried to replicate @textshell answer (https://stackoverflow.com/a/69150479/3267980) but I noticed CLANG behaves differently in my case. The CLANG version I'm using is:

$ clang --version                                             
Debian clang version 13.0.1-+rc1-1~exp4
Target: x86_64-pc-linux-gnu
Thread model: posix

To explain my case, let's consider the following example:

enter image description here

Both my_function and main are functions from the same source file (function_definition_invocation.c). However, it is only specified in the FunctionDecl node of my_function. I presume this behavior is due to the fact that both functions belong to the same file, and CLANG prints the file location only in the node belonging to it.

Once the first occurrence of the main file is found, every consecutive node should be added to the resulting, filtered JSON file. The code I'm using is:

def filter_ast_only_source_file(source_file, json_ast):
    
    new_inner = []
    first_occurrence_of_main_file = False
    for entry in json_ast['inner']:
        if not first_occurrence_of_main_file:
            if entry.get('isImplicit', False):
                continue

            file_name = None
            loc = entry.get('loc', {})
            if 'file' in loc:
                file_name = loc['file']

            if 'expansionLoc' in loc:
                if 'file' in loc['expansionLoc']:
                    file_name = loc['expansionLoc']['file']

            if file_name != source_file:
                continue

            new_inner.append(entry)
            first_occurrence_of_main_file = True
        else:
            new_inner.append(entry)

    json_ast['inner'] = new_inner

And I call it like this:

generated_ast = subprocess.run(["clang", "-Xclang", "-ast-dump=json", source_file], capture_output=True) # Output is in bytes. In case it's needed, decode it to get string
# Parse the output into a JSON object
json_ast = json.loads(generated_ast.stdout)
filter_ast_only_source_file(source_file, json_ast)

So far it seems to be working.

Upvotes: 3

textshell
textshell

Reputation: 2076

The dumped AST has some indication of source file for every node. So the dumped AST can be filtered based on the loc data of the second level AST nodes.

You need to match file in loc and file in expansionLoc in loc against the name of the top level file. This seems to work for me decently. Some of the nodes don't contain these elements for some reason. Nodes with isImplicit should be safe to skip but i'm not sure what is going on with other nodes without file name information.

The following python script filters 'astdump.json' to 'astdump.filtered.json' using these rules (doing the conversion in a streaming manner is left as a exercise for the reader):

#! /usr/bin/python3

import json
import sys

if len(sys.argv) != 2:
    print('Usage: ' + sys.argv[0] + ' filename')
    sys.exit(1)

filename = sys.argv[1]

with open('astdump.json', 'rb') as input, open('astdump.filtered.json', 'w') as output:
    toplevel = json.load(input)
    new_inner = []
    for o in toplevel['inner']:
        if o.get('isImplicit', False):
            continue

        file_name = None
        loc = o.get('loc', {})
        if 'file' in loc:
            file_name = loc['file']

        if 'expansionLoc' in loc:
            if 'file' in loc['expansionLoc']:
                file_name = loc['expansionLoc']['file']

        if file_name != filename:
            continue

        new_inner.append(o)

    toplevel['inner'] = new_inner
    json.dump(toplevel, output, indent=4)

Upvotes: 0

towi
towi

Reputation: 22267

Filtering on a specific identifier is fine, using -ast-dump-filter. But what if you want ast from all identifiers in one file?

I came up with the following solution:

Add one recognizable line after the includes:

#include <iostream>
int XX_MARKER_XX = 123234; // marker line for ast-dump
void my_lambda()
...

Then dump the ast with

clang-check -extra-arg=-std=c++1y -ast-dump lambda.cpp > ast.txt

You can easily cut all stuff before XX_MARKER_XX away with sed:

cat ast.txt | sed -n '/XX_MARKER_XX/,$p'  | less

Still a lot, but much more useful with bigger files.

Upvotes: 4

BeyelerStudios
BeyelerStudios

Reputation: 4283

This is a problem with C++ not with clang: there are no files in C++, there's just the compilation unit. When you #include a file you include all definitions in said file (recursively) into your compilation unit and there's no way to differentiate them (it's what the standard expects your compiler to do).

Imagine a different scenario:

/////////////////////////////
// headertmp.h
#if defined(A)
    struct Foo {
        int bar;
    };
#elif defined(B)
    struct Foo {
        short bar;
    };
#endif

/////////////////////////////
// foobar.cpp
#ifndef A
# define B
#endif

#include "headertmp.h"

void foobar(Foo foo) {
    // do stuff to foo.bar
}

Your foobar.cpp declares a struct called Foo and a function called foobar but headertmp.h itself doesn't define any Foo unless A or B are defined. Only in the compilation unit of foobar where the two come together can you make sense of headertmp.h.

If you are interested in a subset of the declarations inside a compilation unit, you will have to extract the necessary information from the generated AST directly (similar to what a linker has to do when linking together different compilation units). Of course you can then filter the AST of this compilation unit on any metadata your parser extracts.

Upvotes: 1

Alper
Alper

Reputation: 13220

clang-check might be useful on the matter, clang-check has option -ast-dump-filter=<string> documented as follow

-ast-dump-filter=<string> - Use with -ast-dump or -ast-print to dump/print only AST declaration nodes having a certain substring in a qualified name. Use -ast-list to list all filterable declaration node names.

when clang-check run with -ast-dump-filter=my_lambda on the sample code (lambda.cpp)

#include <iostream>

void my_lambda()
{
    auto lambda = [](auto x, auto y) {return x + y;};
    std::cout << "fabricati diem"; 
}

It dumps only matched declaration node FunctionDecl my_lambda 'void (void)'

Here is the command line arguments and few lines from output.

$ clang-check -extra-arg=-std=c++1y -ast-dump -ast-dump-filter=my_lambda lambda.cpp --

FunctionDecl 0x2ddf630 <lambda.cpp:3:1, line:7:1> line:3:6 my_lambda 'void (void)'
`-CompoundStmt 0x2de1558 <line:4:1, line:7:1>
  |-DeclStmt 0x2de0960 <line:5:9, col:57>

Upvotes: 19

Related Questions