Abstract Syntax Tree

This blog post follows my introductory post to Automatic programming. The syntax of a programming language is defined as the set of rules that defines correctly structured code. There are many possible syntax errors. Some examles are:

  • Opened parantheses are not closed
  • Indendations are not correct
  • Variable names are written wrong

It is hard to generate syntactial correct source code. We can make our life easier by not generating source code directly, but generating a different representation of code, that doesn’t have these difficulties. I’m talking about Abstract syntax trees (AST). An AST is a tree representation of the abstract syntactic structure of source code. Source code exists only because it is readable and writeable by humans. AST is a representation that fits better two how a computer works. It exists as an intermediate representation of the program that the compiler requires. In comparison to source code ASTs doesn’t contain inessential punctuation and delimiters, like braces, semicolons and parentheses.  Abstract syntax trees are often used in program analysis and program transformation.

There are python libraries to parse and unparse source code to and from ASTs. This allows us to take existing source code, parse them to their AST representation and use them as examples for our machine learning model. It also makes it easy to create human readable source code from automatically generated ASTs.

The following jupyter notebook shows us how the AST representation of given python source code looks like, and how to write our own AST.

The Python Standard Library gives us the ast module to create and use ASTs. With the parse function Python source code is parsed into it’s tree representation. The dump function prints the AST. The output shows how to create the AST directly with the ast module.
The first example is a simple addition. There is an expression with the two variable x and y and the binary operation add.

In [1]:
import ast

add_tree = ast.parse('x+y')
ast.dump(add_tree)
Out[1]:
"Module(body=[Expr(value=BinOp(left=Name(id='x', ctx=Load()), op=Add(), right=Name(id='y', ctx=Load())))])"

The PyPI package showast has a nice visualization function for ASTs. The visualization shows the tree structure.

In [2]:
from showast import show_ast

show_ast(add_tree)


%3
0
Expr


1
BinOp


0–1

2
Name


1–2

5
Add


1–5

6
Name


1–6

3
“x”


2–3

4
Load


2–4

7
“y”


6–7

8
Load


6–8

In this example the addition is packed into a function. Here we see how actual python code is represented as an AST. I don’t show the dump of the AST, since it is already very convoluted, but with such a dump you could see how to create the AST directly from the ast module.
The visualization shows us, that there are several new nodes which define the function, arguments, return statement and variable definitions. We get the source code from a function with the inspect module. It would be also possible to visualize a function directly with show_source from the showast module.

In [3]:
import inspect

def add_func(x, y):
    return x + y

add_func_tree = ast.parse(inspect.getsource(add_func))
show_ast(add_func_tree)


%3
0
FunctionDef


1
“add_func”


0–1

2
arguments


0–2

9
Return


0–9

3
Name


2–3

6
Name


2–6

4
“x”


3–4

5
Param


3–5

7
“y”


6–7

8
Param


6–8

10
BinOp


9–10

11
Name


10–11

14
Add


10–14

15
Name


10–15

12
“x”


11–12

13
Load


11–13

16
“y”


15–16

17
Load


15–17

Python itself doesn’t provide a way to turn a compiled code object into an AST. There are third party tools that can do this. Astor is such a package.

In [4]:
import astor

print(astor.to_source(add_func_tree))
def add_func(x, y):
    return (x + y)

The last examples shows how to create an AST directly with the ast module. Here you see the implementation of the leibniz formula. This is a formula to calculate the value of pi. The number of iterations corresponds to the accuracy of the result. Here we see why we are writing programs in a programming language and not as an AST. The tree representation is nearly unreadable. You can imagine that modules with hundreds or thousands lines of code result in humongous trees.

In [5]:
arguments = ast.arguments(args=[ast.Name(id='iterations', ctx=ast.Param())], vararg=None, kwarg=None, defaults=[])

pi_quarter = ast.Assign(targets=[ast.Name(id='pi_quarter', ctx=ast.Store())], value=ast.Num(n=0.0))

step_value = ast.BinOp(left=ast.BinOp(left=ast.Num(n=-1.0), op=ast.Pow(), right=ast.Name(id='step', ctx=ast.Load())),
                       op=ast.Div(), right=ast.BinOp(left=ast.BinOp(left=ast.Num(n=2.0),
                                                                    op=ast.Mult(),
                                                                    right=ast.Name(id='step', ctx=ast.Load())),
                                                     op=ast.Add(), right=ast.Num(n=1.0)))
for_loop = ast.For(target=ast.Name(id='step', ctx=ast.Store()),
                   iter=ast.Call(func=ast.Name(id='range', ctx=ast.Load()),
                                 args=[ast.Name(id='iterations', ctx=ast.Load())],
                                 keywords=[], starargs=None, kwargs=None),
                   body=[ast.AugAssign(target=ast.Name(id='pi_quarter', ctx=ast.Store()),
                                       op=ast.Add(), value=step_value)], orelse=[])
pi = ast.Assign(targets=[ast.Name(id='pi', ctx=ast.Store())],
                value=ast.BinOp(left=ast.Num(n=4.0), op=ast.Mult(), right=ast.Name(id='pi_quarter', ctx=ast.Load())))
func = ast.FunctionDef(name='leibniz_formula',
                       args=arguments,
                       body=[pi_quarter, for_loop, pi, ast.Return(value=ast.Name(id='pi', ctx=ast.Load()))],
                       decorator_list=[])
ast_leibniz_formula = ast.fix_missing_locations(ast.Module(body=[func]))

print(astor.to_source(ast_leibniz_formula))
def leibniz_formula(iterations):
    pi_quarter = 0.0
    for step in range(iterations):
        pi_quarter += (((-1.0) ** step) / ((2.0 * step) + 1.0))
    pi = (4.0 * pi_quarter)
    return pi
In [6]:
show_ast(ast_leibniz_formula)


%3
0
FunctionDef


1
“leibniz_formula”


0–1

2
arguments


0–2

6
Assign


0–6

12
For


0–12

48
Assign


0–48

59
Return


0–59

3
Name


2–3

4
“iterations”


3–4

5
Param


3–5

7
Name


6–7

10
Num


6–10

8
“pi_quarter”


7–8

9
Store


7–9

11
0.0


10–11

13
Name


12–13

16
Call


12–16

23
AugAssign


12–23

14
“step”


13–14

15
Store


13–15

17
Name


16–17

20
Name


16–20

18
“range”


17–18

19
Load


17–19

21
“iterations”


20–21

22
Load


20–22

24
Name


23–24

27
Add


23–27

28
BinOp


23–28

25
“pi_quarter”


24–25

26
Store


24–26

29
BinOp


28–29

36
Div


28–36

37
BinOp


28–37

30
Num


29–30

32
Pow


29–32

33
Name


29–33

31
-1.0


30–31

34
“step”


33–34

35
Load


33–35

38
BinOp


37–38

45
Add


37–45

46
Num


37–46

39
Num


38–39

41
Mult


38–41

42
Name


38–42

40
2.0


39–40

43
“step”


42–43

44
Load


42–44

47
1.0


46–47

49
Name


48–49

52
BinOp


48–52

50
“pi”


49–50

51
Store


49–51

53
Num


52–53

55
Mult


52–55

56
Name


52–56

54
4.0


53–54

57
“pi_quarter”


56–57

58
Load


56–58

60
Name


59–60

61
“pi”


60–61

62
Load


60–62

We can compile and execute our newly created function with the built-in compile function and the exec statement. The leibniz_formula is now usable as function we are used to in Python.

In [7]:
exec(compile(ast_leibniz_formula, filename='<ast>', mode='exec'))
print(leibniz_formula(1000))
3.14059265384