This extension for Saxon 6.5 provides an XSLT extension function for tokenizing programming code fragments. The extension function can be used to perform on-fly syntax highlighting during the transformation process.
How does it works? The function returns an interim node-set containing a tokenized representation of the code. The stylesheet can then process this node-set, e.g. to generate formatted, syntax-highlighted output.
Download the saxon6-tokenizer.zip distribution. After download, just unzip the file and make sure that the saxon6tokenizer.jar file is part of your classpath if you run Saxon the next time.
Note: The extension works only with Saxon 6.5 (and maybe some other versions of the 6.x series). It does not work with Saxon 8.
To use the extension in Saxon, you have to associate the extension’s namespace java:com.oliverh.xsltext.tokenizer.Saxon6Tokenizer
(that’s the Java class defining the extension) with a namespace prefix of your choice. I will use the prefix syn
in the following examples.
If the namespace is properly bound, the following two extension functions are available in XPath expressions:
syn:tokenize(program,language)
: Returns a node-set containing the tokenized representation of the given program (for more details, see below).syn:supportsSyntax(language)
: Returns a boolean value indicating whether the given programming language is supported by the tokenizer or not.Both arguments are interpreted as string values.
The function splits the given program fragment up into lines and tokens and returns them as node set. The node set consists of <line>
elements which themselves contain nested <token>
elements and text nodes.
The <token>
elements have an attribute class
describing the type of the token. Possible values are keyword
, literal
, operator
and comment
.
The extension uses the free jEdit Syntax Package as tokenizer engine which accepts the following programming languages names:
Batch, CC, C, Eiffel, HTML, IDL, JavaScript, Java, Makefile, PHP, Patch, Perl, Props, Python, SQL, ShellScript, TSQL, TeX, XML, XPath
For more information about the tokenizer, have a look at the homepage of the jEdit Syntax Package. If you know a better tokenizer engine, please let me know.
Consider the following Java code fragment:
// Print a hello message public void sayHello() { System.out.println("Hello!"); }
If you provide this fragment to the syn:tokenize
extension function with the value Java
as second argument, the function will return a node set which corresponds to:
<line><token class='comment'>// Print a hello message</token></line> <line><token class='keyword'>void</token> sayHello<token class='operator'>(</token>
For further details, have a look at test/test.xsl.