Abstract Syntax Tree

This blog post follows my introductory post to Automatic programming. The syntax of a programming language is defined as the set of rules that defines correctly structured code. There are many possible syntax errors. Some examles are:

  • Opened parantheses are not closed
  • Indendations are not correct
  • Variable names are written wrong

It is hard to generate syntactial correct source code. We can make our life easier by not generating source code directly, but generating a different representation of code, that doesn’t have these difficulties. I’m talking about Abstract syntax trees (AST). An AST is a tree representation of the abstract syntactic structure of source code. Source code exists only because it is readable and writeable by humans. AST is a representation that fits better two how a computer works. It exists as an intermediate representation of the program that the compiler requires. In comparison to source code ASTs doesn’t contain inessential punctuation and delimiters, like braces, semicolons and parentheses.  Abstract syntax trees are often used in program analysis and program transformation.

There are python libraries to parse and unparse source code to and from ASTs. This allows us to take existing source code, parse them to their AST representation and use them as examples for our machine learning model. It also makes it easy to create human readable source code from automatically generated ASTs.

The following jupyter notebook shows us how the AST representation of given python source code looks like, and how to write our own AST.

The Python Standard Library gives us the ast module to create and use ASTs. With the parse function Python source code is parsed into it’s tree representation. The dump function prints the AST. The output shows how to create the AST directly with the ast module.
The first example is a simple addition. There is an expression with the two variable x and y and the binary operation add.

In [1]:
import ast

add_tree = ast.parse('x+y')
ast.dump(add_tree)
Out[1]:
"Module(body=[Expr(value=BinOp(left=Name(id='x', ctx=Load()), op=Add(), right=Name(id='y', ctx=Load())))])"

The PyPI package showast has a nice visualization function for ASTs. The visualization shows the tree structure.

In [2]:
from showast import show_ast

show_ast(add_tree)
%3 0 Expr 1 BinOp 0–1 2 Name 1–2 5 Add 1–5 6 Name 1–6 3 "x" 2–3 4 Load 2–4 7 "y" 6–7 8 Load 6–8

In this example the addition is packed into a function. Here we see how actual python code is represented as an AST. I don’t show the dump of the AST, since it is already very convoluted, but with such a dump you could see how to create the AST directly from the ast module.
The visualization shows us, that there are several new nodes which define the function, arguments, return statement and variable definitions. We get the source code from a function with the inspect module. It would be also possible to visualize a function directly with show_source from the showast module.

In [3]:
import inspect

def add_func(x, y):
    return x + y

add_func_tree = ast.parse(inspect.getsource(add_func))
show_ast(add_func_tree)
%3 0 FunctionDef 1 "add_func" 0–1 2 arguments 0–2 9 Return 0–9 3 Name 2–3 6 Name 2–6 4 "x" 3–4 5 Param 3–5 7 "y" 6–7 8 Param 6–8 10 BinOp 9–10 11 Name 10–11 14 Add 10–14 15 Name 10–15 12 "x" 11–12 13 Load 11–13 16 "y" 15–16 17 Load 15–17

Python itself doesn’t provide a way to turn a compiled code object into an AST. There are third party tools that can do this. Astor is such a package.

In [4]:
import astor

print(astor.to_source(add_func_tree))
def add_func(x, y):
    return (x + y)

The last examples shows how to create an AST directly with the ast module. Here you see the implementation of the leibniz formula. This is a formula to calculate the value of pi. The number of iterations corresponds to the accuracy of the result. Here we see why we are writing programs in a programming language and not as an AST. The tree representation is nearly unreadable. You can imagine that modules with hundreds or thousands lines of code result in humongous trees.

In [5]:
arguments = ast.arguments(args=[ast.Name(id='iterations', ctx=ast.Param())], vararg=None, kwarg=None, defaults=[])

pi_quarter = ast.Assign(targets=[ast.Name(id='pi_quarter', ctx=ast.Store())], value=ast.Num(n=0.0))

step_value = ast.BinOp(left=ast.BinOp(left=ast.Num(n=-1.0), op=ast.Pow(), right=ast.Name(id='step', ctx=ast.Load())),
                       op=ast.Div(), right=ast.BinOp(left=ast.BinOp(left=ast.Num(n=2.0),
                                                                    op=ast.Mult(),
                                                                    right=ast.Name(id='step', ctx=ast.Load())),
                                                     op=ast.Add(), right=ast.Num(n=1.0)))
for_loop = ast.For(target=ast.Name(id='step', ctx=ast.Store()),
                   iter=ast.Call(func=ast.Name(id='range', ctx=ast.Load()),
                                 args=[ast.Name(id='iterations', ctx=ast.Load())],
                                 keywords=[], starargs=None, kwargs=None),
                   body=[ast.AugAssign(target=ast.Name(id='pi_quarter', ctx=ast.Store()),
                                       op=ast.Add(), value=step_value)], orelse=[])
pi = ast.Assign(targets=[ast.Name(id='pi', ctx=ast.Store())],
                value=ast.BinOp(left=ast.Num(n=4.0), op=ast.Mult(), right=ast.Name(id='pi_quarter', ctx=ast.Load())))
func = ast.FunctionDef(name='leibniz_formula',
                       args=arguments,
                       body=[pi_quarter, for_loop, pi, ast.Return(value=ast.Name(id='pi', ctx=ast.Load()))],
                       decorator_list=[])
ast_leibniz_formula = ast.fix_missing_locations(ast.Module(body=[func]))

print(astor.to_source(ast_leibniz_formula))
def leibniz_formula(iterations):
    pi_quarter = 0.0
    for step in range(iterations):
        pi_quarter += (((-1.0) ** step) / ((2.0 * step) + 1.0))
    pi = (4.0 * pi_quarter)
    return pi
In [6]:
show_ast(ast_leibniz_formula)
%3 0 FunctionDef 1 "leibniz_formula" 0–1 2 arguments 0–2 6 Assign 0–6 12 For 0–12 48 Assign 0–48 59 Return 0–59 3 Name 2–3 4 "iterations" 3–4 5 Param 3–5 7 Name 6–7 10 Num 6–10 8 "pi_quarter" 7–8 9 Store 7–9 11 0.0 10–11 13 Name 12–13 16 Call 12–16 23 AugAssign 12–23 14 "step" 13–14 15 Store 13–15 17 Name 16–17 20 Name 16–20 18 "range" 17–18 19 Load 17–19 21 "iterations" 20–21 22 Load 20–22 24 Name 23–24 27 Add 23–27 28 BinOp 23–28 25 "pi_quarter" 24–25 26 Store 24–26 29 BinOp 28–29 36 Div 28–36 37 BinOp 28–37 30 Num 29–30 32 Pow 29–32 33 Name 29–33 31 -1.0 30–31 34 "step" 33–34 35 Load 33–35 38 BinOp 37–38 45 Add 37–45 46 Num 37–46 39 Num 38–39 41 Mult 38–41 42 Name 38–42 40 2.0 39–40 43 "step" 42–43 44 Load 42–44 47 1.0 46–47 49 Name 48–49 52 BinOp 48–52 50 "pi" 49–50 51 Store 49–51 53 Num 52–53 55 Mult 52–55 56 Name 52–56 54 4.0 53–54 57 "pi_quarter" 56–57 58 Load 56–58 60 Name 59–60 61 "pi" 60–61 62 Load 60–62

We can compile and execute our newly created function with the built-in compile function and the exec statement. The leibniz_formula is now usable as function we are used to in Python.

In [7]:
exec(compile(ast_leibniz_formula, filename='<ast>', mode='exec'))
print(leibniz_formula(1000))
3.14059265384

Add a Comment

Your email address will not be published. Required fields are marked *