Deprecated: Function split() is deprecated in /www/htdocs/w00d9226/oliverh.com/inc/parser/lexer.php on line 510

Deprecated: Function split() is deprecated in /www/htdocs/w00d9226/oliverh.com/inc/parser/lexer.php on line 510

Deprecated: Function split() is deprecated in /www/htdocs/w00d9226/oliverh.com/inc/parser/lexer.php on line 510

Syntax Tokenizer Extension for Saxon

Syntax Tokenizer Extension for Saxon

This extension is no langer maintained. The current version works only for Saxon 6.5 but not for Saxon 8.x.

This extension for Saxon 6.5 provides an XSLT extension function for tokenizing programming code fragments. The extension function can be used to perform on-fly syntax highlighting during the transformation process.

How does it works? The function returns an interim node-set containing a tokenized representation of the code. The stylesheet can then process this node-set, e.g. to generate formatted, syntax-highlighted output.

Download and Installation

Download the saxon6-tokenizer.zip distribution. After download, just unzip the file and make sure that the saxon6tokenizer.jar file is part of your classpath if you run Saxon the next time.

Note: The extension works only with Saxon 6.5 (and maybe some other versions of the 6.x series). It does not work with Saxon 8.

Usage

To use the extension in Saxon, you have to associate the extension’s namespace java:com.oliverh.xsltext.tokenizer.Saxon6Tokenizer (that’s the Java class defining the extension) with a namespace prefix of your choice. I will use the prefix syn in the following examples.

If the namespace is properly bound, the following two extension functions are available in XPath expressions:

syn:tokenize(program,language): Returns a node-set containing the tokenized representation of the given program (for more details, see below).
syn:supportsSyntax(language): Returns a boolean value indicating whether the given programming language is supported by the tokenizer or not.

Both arguments are interpreted as string values.

Tokenized Node-set

The function splits the given program fragment up into lines and tokens and returns them as node set. The node set consists of <line> elements which themselves contain nested <token> elements and text nodes.

The <token> elements have an attribute class describing the type of the token. Possible values are keyword, literal, operator and comment.

Supported Programming Languages

The extension uses the free jEdit Syntax Package as tokenizer engine which accepts the following programming languages names:

Batch, CC, C, Eiffel, HTML, IDL, JavaScript, Java, Makefile, PHP, Patch, Perl, Props, Python, SQL, ShellScript, TSQL, TeX, XML, XPath

For more information about the tokenizer, have a look at the homepage of the jEdit Syntax Package. If you know a better tokenizer engine, please let me know.

Example

Consider the following Java code fragment:

// Print a hello message
public void sayHello() {
  System.out.println("Hello!");
}

If you provide this fragment to the syn:tokenize extension function with the value Java as second argument, the function will return a node set which corresponds to:

<line><token class='comment'>// Print a hello message</token></line>
<line><token class='keyword'>void</token> sayHello<token class='operator'>(</token>

For further details, have a look at test/test.xsl.