Reputation: 7364
I'm generating AST using clang. I've got following file (lambda.cpp) to parse:
#include <iostream>
void my_lambda()
{
auto lambda = [](auto x, auto y) {return x + y;};
std::cout << "fabricati diem";
}
I'm parsing this using following command:
clang -Xclang -ast-dump -fsyntax-only lambda.cpp
The problem is that clang parses also headers content. As a result, I've got quite big (~3000 lines) file with useless (for me) content.
How to exclude headers when generating AST?
Upvotes: 26
Views: 12864
Reputation: 398
I'm facing the same problem. My context is that I need to parse the AST in JSON format, and I'd like to get rid of all the headers and unnecessary files. I tried to replicate @textshell answer (https://stackoverflow.com/a/69150479/3267980) but I noticed CLANG behaves differently in my case. The CLANG version I'm using is:
$ clang --version
Debian clang version 13.0.1-+rc1-1~exp4
Target: x86_64-pc-linux-gnu
Thread model: posix
To explain my case, let's consider the following example:
Both my_function
and main
are functions from the same source file (function_definition_invocation.c). However, it is only specified in the FunctionDecl
node of my_function
. I presume this behavior is due to the fact that both functions belong to the same file, and CLANG prints the file location only in the node belonging to it.
Once the first occurrence of the main file is found, every consecutive node should be added to the resulting, filtered JSON file. The code I'm using is:
def filter_ast_only_source_file(source_file, json_ast):
new_inner = []
first_occurrence_of_main_file = False
for entry in json_ast['inner']:
if not first_occurrence_of_main_file:
if entry.get('isImplicit', False):
continue
file_name = None
loc = entry.get('loc', {})
if 'file' in loc:
file_name = loc['file']
if 'expansionLoc' in loc:
if 'file' in loc['expansionLoc']:
file_name = loc['expansionLoc']['file']
if file_name != source_file:
continue
new_inner.append(entry)
first_occurrence_of_main_file = True
else:
new_inner.append(entry)
json_ast['inner'] = new_inner
And I call it like this:
generated_ast = subprocess.run(["clang", "-Xclang", "-ast-dump=json", source_file], capture_output=True) # Output is in bytes. In case it's needed, decode it to get string
# Parse the output into a JSON object
json_ast = json.loads(generated_ast.stdout)
filter_ast_only_source_file(source_file, json_ast)
So far it seems to be working.
Upvotes: 3
Reputation: 2076
The dumped AST has some indication of source file for every node. So the dumped AST can be filtered based on the loc
data of the second level AST nodes.
You need to match file
in loc
and file
in expansionLoc
in loc
against the name of the top level file. This seems to work for me decently. Some of the nodes don't contain these elements for some reason. Nodes with isImplicit
should be safe to skip but i'm not sure what is going on with other nodes without file name information.
The following python script filters 'astdump.json' to 'astdump.filtered.json' using these rules (doing the conversion in a streaming manner is left as a exercise for the reader):
#! /usr/bin/python3
import json
import sys
if len(sys.argv) != 2:
print('Usage: ' + sys.argv[0] + ' filename')
sys.exit(1)
filename = sys.argv[1]
with open('astdump.json', 'rb') as input, open('astdump.filtered.json', 'w') as output:
toplevel = json.load(input)
new_inner = []
for o in toplevel['inner']:
if o.get('isImplicit', False):
continue
file_name = None
loc = o.get('loc', {})
if 'file' in loc:
file_name = loc['file']
if 'expansionLoc' in loc:
if 'file' in loc['expansionLoc']:
file_name = loc['expansionLoc']['file']
if file_name != filename:
continue
new_inner.append(o)
toplevel['inner'] = new_inner
json.dump(toplevel, output, indent=4)
Upvotes: 0
Reputation: 22267
Filtering on a specific identifier is fine, using -ast-dump-filter
. But what if you want ast from all identifiers in one file?
I came up with the following solution:
Add one recognizable line after the includes:
#include <iostream>
int XX_MARKER_XX = 123234; // marker line for ast-dump
void my_lambda()
...
Then dump the ast with
clang-check -extra-arg=-std=c++1y -ast-dump lambda.cpp > ast.txt
You can easily cut all stuff before XX_MARKER_XX
away with sed
:
cat ast.txt | sed -n '/XX_MARKER_XX/,$p' | less
Still a lot, but much more useful with bigger files.
Upvotes: 4
Reputation: 4283
This is a problem with C++ not with clang: there are no files in C++, there's just the compilation unit. When you #include
a file you include all definitions in said file (recursively) into your compilation unit and there's no way to differentiate them (it's what the standard expects your compiler to do).
Imagine a different scenario:
/////////////////////////////
// headertmp.h
#if defined(A)
struct Foo {
int bar;
};
#elif defined(B)
struct Foo {
short bar;
};
#endif
/////////////////////////////
// foobar.cpp
#ifndef A
# define B
#endif
#include "headertmp.h"
void foobar(Foo foo) {
// do stuff to foo.bar
}
Your foobar.cpp declares a struct called Foo
and a function called foobar
but headertmp.h
itself doesn't define any Foo
unless A
or B
are defined. Only in the compilation unit of foobar where the two come together can you make sense of headertmp.h
.
If you are interested in a subset of the declarations inside a compilation unit, you will have to extract the necessary information from the generated AST directly (similar to what a linker has to do when linking together different compilation units). Of course you can then filter the AST of this compilation unit on any metadata your parser extracts.
Upvotes: 1
Reputation: 13220
clang-check
might be useful on the matter, clang-check
has option -ast-dump-filter=<string>
documented as follow
-ast-dump-filter=<string> - Use with -ast-dump or -ast-print to dump/print only AST declaration nodes having a certain substring in a qualified name. Use -ast-list to list all filterable declaration node names.
when clang-check
run with -ast-dump-filter=my_lambda
on the sample code (lambda.cpp)
#include <iostream>
void my_lambda()
{
auto lambda = [](auto x, auto y) {return x + y;};
std::cout << "fabricati diem";
}
It dumps only matched declaration node FunctionDecl my_lambda 'void (void)'
Here is the command line arguments and few lines from output.
$ clang-check -extra-arg=-std=c++1y -ast-dump -ast-dump-filter=my_lambda lambda.cpp --
FunctionDecl 0x2ddf630 <lambda.cpp:3:1, line:7:1> line:3:6 my_lambda 'void (void)'
`-CompoundStmt 0x2de1558 <line:4:1, line:7:1>
|-DeclStmt 0x2de0960 <line:5:9, col:57>
Upvotes: 19