Best way to implement an abstract syntax tree?
Hi all.
I am making a small scripting language for fun. Have managed to get the tokenizer working, but now I have another problem. Basically, I am dividing the source code into a series of tokens with the tokenizer and then I am going to use those tokens to build a tree structure representing the code (called an abstract syntax tree). Thing is I am not sure what the best way to implement one of those is. Here is the wikipedia article about it:
Abstract syntax tree - Wikipedia, the free encyclopedia
I am aware that it is a rather fuzzy question, but do anyone know the best, or at least a decent, way to do this? Implement an abstract syntax tree, that is?
Any help would be greatly appriciated :).
Take care,
Kerr
Re: Best way to implement an abstract syntax tree?
Do you know how to implement a tree data structure? If not, this should be the first place to start. There are lots of tutorials online - here is a descent one that concentrates on binary trees but gives a good overview of some of the fundamental concepts
Binary Trees
Re: Best way to implement an abstract syntax tree?
Quote:
Originally Posted by
copeg
Do you know how to implement a tree data structure? If not, this should be the first place to start. There are lots of tutorials online - here is a descent one that concentrates on binary trees but gives a good overview of some of the fundamental concepts
Binary Trees
I know how a tree structure works. It is the case of a abstract syntax tree I am unsure of. For example, if I want to implement a function that generates fibonacci numbers, I would maybe do something like this (in my scripting language):
Code :
func fib[num]
if num < 2
return num
end
return fib[num-1] + fib[num-2]
end
What would the abstract syntax tree look like? How would I represent the different parts (like the function definition and the if statement)?
Re: Best way to implement an abstract syntax tree?
You need to first define the relationship between the nodes and what the nodes and their organization represent (code and code structure). You could start by first creating a Node class (containing references to the parent and any children). From there you can define different nodes for different purposes (comparisons, control statements, variables, method calls, etc...) which could in part define how the tree will be traversed. How you design this is up to you (make Node abstract, create an interface that you pass to Node upon instantiation, etc...).
For the example you posted, the func might define a node with one parent (the parent being a node representing the parameter value) and one child - the child being a comparison Node (evaluates to a boolean). The comparison Node in turn could have 2 child nodes - one of which returns the num value to its parent and the other returns the evaluation of the method call - how these are chose is based upon the evaluation of the boolean.
My advice would be to start with something more trivial than a recursive method. Start with control structures and variable definitions, then move to method calls and things more complex. This is an excellent problem however, and thanks for asking the question because it has peaked my interest
Re: Best way to implement an abstract syntax tree?
Quote:
Originally Posted by
copeg
You need to first define the relationship between the nodes and what the nodes and their organization represent (code and code structure). You could start by first creating a Node class (containing references to the parent and any children). From there you can define different nodes for different purposes (comparisons, control statements, variables, method calls, etc...) which could in part define how the tree will be traversed. How you design this is up to you (make Node abstract, create an interface that you pass to Node upon instantiation, etc...).
For the example you posted, the func might define a node with one parent (the parent being a node representing the parameter value) and one child - the child being a comparison Node (evaluates to a boolean). The comparison Node in turn could have 2 child nodes - one of which returns the num value to its parent and the other returns the evaluation of the method call - how these are chose is based upon the evaluation of the boolean.
I guess I can just try and keep it simple. Make a base node and then have other nodes for more specific behaviour. Thing is, I want to avoid too many casts and all that when I use the tree later on. Can think of other ways to do it. For example, when I searched on google someone mentioned that the visitor pattern could come in handy, but I am not sure if that is true or not. Another way could be to create one general node class that basically just defines a parent node, a list of child nodes and maybe contains a string representing any other data it may want to hold (like a variable name). Then an enum would be used to identify what kind of a node it is. But I dont know if that is a good idea. Rather tired at the moment, so I cannot think correctly, lol.
As a note... I have a habit of getting stuck on the problems I see rather then the solutions to those problems. Which is probably a mayor reason to why I have issues figuring this out.
Quote:
My advice would be to start with something more trivial than a recursive method. Start with control structures and variable definitions, then move to method calls and things more complex.
You are probably right. Just so easy to think too large xD.
Quote:
This is an excellent problem however, and thanks for asking the question because it has peaked my interest
I have taken an interest in compilers and such things for some odd reason. It just seems like a very good challenge, and thus, a great way to improve as a programmer. Thats my theory, anyway :P.
Re: Best way to implement an abstract syntax tree?
I would suggest using ANTLR to create a simple scripting language. ANTLR is a tokenizer and recursive descent parser, perfect for what it sounds like you want to accomplish.
If you want to write your own parser, you can look at the examples and see how ANTLR produces its AST's (they're not actually binary trees, each node can have any number of children).
For more information on the method ANTLR uses: Wikipedia - Recursive Descent Parser
Re: Best way to implement an abstract syntax tree?
Quote:
Originally Posted by
helloworld922
I would suggest using
ANTLR to create a simple scripting language. ANTLR is a tokenizer and recursive descent parser, perfect for what it sounds like you want to accomplish.
If you want to right your own parser, you can look at the examples and see how ANTLR produces its AST's (they're not actually binary trees, each node can have any number of children).
For more information on the method ANTLR uses:
Wikipedia - Recursive Descent Parser
Thanks. I want to write my own parser, but as you suggested I can take a look at how they did it. Quite frankly I think that is a great idea!
Re: Best way to implement an abstract syntax tree?
An update. Have taken a look at ANTLR, and how it does things. So I have decided that I will implement the tree with this class:
Code java:
public class Node implements Iterable<Node> {
private Node parent;
private NodeType type;
private String data;
private int lineNum;
protected List<Node> children;
/**
* Creates a new node with the given parent node and node type. It also
* takes the line number the node was on in the source file as an argument
* to enable easier debugging of scripts.
*
* @param type the type of node
* @param parent the parent node
* @param lineNum the line number
*/
public Node(NodeType type, Node parent, int lineNum) {
this(type, parent, lineNum, null);
}
/**
* Creates a new node instance with the given parent node, node type and
* data string. It also takes the line number the node was on in the source
* file as an argument to enable easier debugging of scripts.
*
* @param type the type of node
* @param parent the parent node
* @param lineNum the line number
* @param data the data of the node
*/
public Node(NodeType type, Node parent, int lineNum, String data) {
this.parent = parent;
this.type = type;
this.lineNum = lineNum;
this.data = data;
children = new ArrayList<>();
}
/**
* Get the parent node of this node
*
* @return the parent node
*/
private Node getParent() {
return parent;
}
/**
* Get the node at the given index
*
* @param index the index
* @return the node at the given nodex
*/
public Node getChild(int index) {
return children.get(index);
}
/**
* Sets the node at the given index
*
* @param index the index
* @param n the node
*/
public void setChild(int index, Node n) {
children.set(index, n);
}
/**
* Adds a new node at the end of the node list
*
* @param n the node to be added
*/
public void addChild(Node n) {
children.add(n);
}
/**
* Add a collection of child nodes to this node
*
* @param nodes the child nodes to be
*/
public void addChildren(Collection<Node> nodes) {
children.addAll(nodes);
}
/**
* Removes the node at the given index
*
* @param index
*/
public void removeChild(int index) {
children.remove(index);
}
/**
* How many child nodes this node has
*
* @return the amount of child nodes of this node
*/
public int childCount() {
return children.size();
}
/**
* Return the type of node
*
* @return what type this node is
*/
public NodeType getType() {
return type;
}
/**
* Return the data contained in the node, or null if none exists
*
* @return the node data
*/
public String getData() {
return data;
}
/**
* Return the line number of the node
*
* @return the line number
*/
public int getLineNumber() {
return lineNum;
}
/**
* Returns an iterator over the nodes
*
* @return an iterator over the nodes
*/
@Override
public Iterator<Node> iterator() {
return children.iterator();
}
/**
* Creates a new root node
*
* @return the new root node
*/
public static Node newRootNode() {
return new Node(NodeType.ROOT, null, 0, null);
}
}
Basically each node will contain a node type, which says what it is representing (a loop, a function, an assingment, etc). Then it will have a string that contains any optional data, like a variable name or a number or something. Any opinions? Is this a good or bad way?