Convert JSP pages to JSP documents (JSPX) with Jsp2x

Jsp2X is a command line utility for batch conversion of JSP pages to JSP documents, i.e. JSPs in well-formed XML syntax (aka JSPX, see chapter 5 of the JavaServer Pages^TM 1.2 Specification and chapter 6 of the JavaServer Pages^TM 2.0 Specification). It is written in Java and incorporates a parser derived from a combined JSP+XHTML grammar using the ANTLR parser generator. It tries very hard to create JSPX output that portable across engines. Jsp2X was designed to be used in an iterative fashion in which it alerts the user of potential problems in the input.

Introduction

Version 1.2 of the JSP standard introduces the notion of JSP documents which are simply JSP files in well-formed XML syntax. Files in traditional JSP format, also known as JSP pages contain a more or less free-form tag soup for which parsers are difficult to write and which are therefore hard to digest in an automated manner. It took a long time until the various JSP engine vendors agreed on what was valid JSP and what wasn't. I usually prefer the Jetty servlet container for testing a web application during development because it starts up quickly which reduces the time it takes to switch between coding and testing an application. When I later deploy that application to Resin I am bewildered to see Resin reject the JSPs that worked flawlessly in Jetty. An upgrade to Resin 3.0.23 fixes many discrepancies but I still end up tweaking my JSP pages to make them work in both containers.

JSP documents are well-formed XML. XML has a strict and precise (albeit verbose) syntax. There are plenty of parsers and other tools available for XML. Making your JSP files XML-compliant therefore opens a world of possibilities for further processing. For example, I have haven't found a single JSP editor that correctly formats and highlights anything but the simplest pages. With JSP documents these problem have a trivial solution: use your favorite XML editor.

Another annoying trait of JSP pages is that the JSP engine preserves insignificant whitespace. A JSP parser only parses what looks like a JSP tag or a directive even if the text in between is well-formed XML. For that reason it can't detect and remove whitespace that would be considered insignificant by XML or HTML standards. This unnecessarily increases the size of the emitted HTML. The more JSP code is factored out into tag files or included JSP fragments, the more insignificant whitespace generated and sent to the browser. In JSP documents, on the other hand, it is very easy to detect and drop insignificant whitespace. In fact, if the JSP engine uses an XML parser to read the input, the parser will take care of whitespace on behalf of the engine. To give you a rough idea about the potential savings: after I converted all 70+ JSP pages and tag files of a well-factored 100k SLOC web application to JSP documents, the average size of the HTML output decreased by 50% to 75%!

Taking into account that the template text in most JSP pages is in fact XHTML or HTML the JSP committee realized that it isn't a very long road from a JSP page to a well-formed XML document. They only had to get rid of the leniency in the JSP parser and come up with alternatives for crazy constructs like <a href="<c:url …>"> . This thought process led to the definition of JSP documents in the JSP standard at time when millions of JSP pages had already been written an deployed. This is where Jsp2X comes in. It is a tool that assists in the conversion of JSP pages to JSP documents, a process that is generally straight-forward but tends to be tedious and has the potential to introduce subtle errors when executed by hand.

To understand what JspX does you need to keep in mind that unlike a JSP engine, Jsp2X parses both the JSP tags and the template text in between those tags. In that respect Jsp2X incorporates a more complex parser than what you'd find in a typical JSP engine (luckily, I had a very powerful and yet easy-to-use tool at hand: ANTLR, a robust LL(*) parser generator). More importantly, Jsp2X can successfully parse the template text in your JSP pages only if it is reasonably correct XHTML. Jsp2X doesn't expect fully well-formed XML template text. It requires that all tags are nested properly and that empty tags are closed correctly. There is no need for a single root element - Jsp2X will create one on-the-fly if necessary.

Where can I get it?

The latest binary and source distributions be downloaded from this page. To compile the sources you need Maven version 2.0.7 and a JDK 1.6.0_02. Older Maven 2.0 releases >= 2.0.4 may work as well and a recent 1.5 JDK should be fine, too. Jsp2X is released under the LGPL.

The usage of the binary distribution is described in section Usage.

The source code repository is hosted at Google Code.

What exactly does it do?

A conversion of a single JSP page requires a number of different transformations. The following is a hopefully complete list:

Jsp2X writes the converted input to an output file whose name is derived from the input file. The extension of the output file name is mapped according to what the JSP standard lists as standard extensions for JSP pages/documents, tag files and fragments (also see Usage).
Jsp2X adds four very short utility tag files to the converted project. They have the jspx: prefix and contain functionality that would otherwise clutter the converted JSP document.
Jsp2X wraps the JSP page in a <jsp:root> tag.
Jsp2X wraps JSP fragments into a <jspx:fragment> tag. <jsp:root> tags in fragments are disallowed so I had to come up with another tag that is transparent with respect to the generated output and that can be used to collect the potentially many top-level elements of a fragment underneath a single top-level element (a requirement of XML well-formedness).
Jsp2X converts all taglib declarations to name space references on the new root element (<jsp:root> or <jsp:fragment>). Unused taglibs are omitted. Jsp2X even detects taglibs that are declared in a fragment that is included by the JSP page to be converted. JSP page authors often move their taglib declarations to a separate file that is then included at the top of every JSP page.
Jsp2X escapes special XML characters in the input. Keep in mind that an JSP document is parsed twice, once by the JSP engine's XML parser and once on the client side by the browser's HTML/XHTML parser. If you wanted to display a literal < on a page, it was sufficient to put the HTML entity < into the JSP page because the entity had no special meaning to the JSP parser. A JSP document would have to read &lt; to get the desired effect. The JSP parser will substitute & with & such that the browser gets the intended &lt ; and renders that as < . Jsp2X does the necessary escaping for you.
Jsp2X wraps template text in <jsp:text> tags, excluding insignificant whitespace.
Jsp2X escapes HTML comments and converts JSP comments to XML comments with the intended effect that HTML comments will end up in the output whereas JSP comments do not.
Jsp2X wraps scriptlets and expressions in <jsp:scriptlet> and <jsp:expression> tags respectively.
Jsp2X inserts escaped HTML comments into the body of elements with empty bodies to prevent them from being collapsed into empty element: <td></td> becomes <td><!----&gt</td> . This is definitely noisy but I found no other way to prevent the JSP engine's XML parser from collapsing empty element bodies. One of the goals for Jsp2X was to preserve the intent of a JSP page as much as possible. Luckily, a typical HTML page doesn't contain that many empty elements so the added syntactic noise will be minimal.

Jsp2X tries to detect and convert dynamic attribute constructs. The detection of these constructs is not bullet-proof because Jsp2X does not have a full-blown EL expression parser. Instead it uses regexes to detect the most common cases. The table below lists the supported cases (with additional whitespace and indention for clarity).

JSP page	JSP document
<foo x="<bar …>"> … </foo>	<jspx:element name="foo"> <jspx:attribute name="x"/><bar…></jspx:attribute> <jspx:body>…<jspx:body> </jspx:element>
<foo <c:if test="…">x="…"<c:if>> … </foo>	<jspx:element name="foo"> <c:if test="…"> <jspx:attribute name="x"/>…</jspx:attribute> <c:if> <jspx:body>…<jspx:body> </jspx:element>
<foo ${condition : 'x="…"' ? ''}> … </foo>	<jspx:element name="foo"> <c:if test="${condition}"> <jspx:attribute name="x"/>…</jspx:attribute> <c:if> <jspx:body>…<jspx:body> </jspx:element>
<foo ${condition : '' ? 'x="…"'}> … </foo>	<jspx:element name="foo"> <c:if test="${!(condition)}"> <jspx:attribute name="x"/>…</jspx:attribute> <c:if><jspx:body>…<jspx:body> </jspx:element>
<foo ${condition : 'x="…"' ? 'y="…"'}> … </foo>	<jspx:element name="foo"> <c:choose> <c:when test="${condition}"> <jspx:attribute name="x"/>…</jspx:attribute> <c:when> <c:otherwise> <jspx:attribute name="y"/>…</jspx:attribute> </c:otherwise> </c:choose> <jspx:body>…<jspx:body> </jspx:element>

Jsp2X rewrites the file extension in references to an included file as long as the included file is also listed as an input file. This is why you should convert all JSP files in a single invocation of Jsp2X. If you don't Jsp2X will not be able to rewrite references to converted files.
Jsp2X converts DOCTYPE declarations to <jsp:output> elements.

You might notice the use of <jspx:element> and <jspx:attribute> tags where you'd expect JSP's built-in <jsp:element> and <jsp:attribute> tags. The reason is that the built-in mechanism doesn't work for conditional attributes (something I consider a blatant oversight in the standard). For example,

<jsp:element …><c:if …><jsp:attribute …>…</jsp:attribute></c:if></jsp:element>

doesn't work because the attribute element applies to the <c:if> tag, not the <jsp:element> tag. It is in accordance with the standard but the standard should have been written to accommodate this very common use case. Jsp2X creates several tag files with custom tags that have similar functionality to <jsp:element> , <jsp:attribute> and <jsp:body> but work for conditional attributes:

<jspx:element name="foo"><c:if …><jspx:attribute name="bar">…</jsp:attribute></c:if></jsp:element> .

Another difference is that <jspx:element> distinguishes between empty tags and tags with empty bodies. For example, a JSP page with

<jspx:element name="foo"><jsp:body/></jsp:element>

will emit <foo></foo> and

<jspx:element name="foo"></jsp:element> or <jspx:element name="foo"/>

will emit <foo/> . The jsp: variant would have emitted <foo/> in either case. This is XML-compliant but violates HTML (not XHTML) in which <div></div> and <div/> are treated differently. The latter is actually disallowed and the its effect differs from browser to browser. FF treats it like an opening <div> and implicitly closes it at the end of the parent tag, e.g.

<td><div class="a"/><div>foo</div><td> is treated like

<td><div class="a"><div>foo</div></div></td> .

IE7 simply ignores everything after the <div/> .

The use of Jsp2X's custom <jspx:element> instead of the built-in <jsp:element> assists in creating output that is more likely to preserve the JSP page author's intent. It also enables the use of HTML (albeit a somewhat stricter dialect of it) as opposed restricting the template text to pure XHTML.

Requirements

mandatory: JDK 5 or higher
recommended: JSP files named with standardized extensions ( .tag , .jsp and .jspf .
recommended: Access to the complete set of all JSP files that comprise the web application (i.e. everything underneath the WEB-INF directory).
recommended: The include directives in every input JSP page should use context-relative URIs to refer to other JSP files (as in /WEB-INF/jsp/taglibs.jspf ).

Usage

Jsp2X is distributed as an executable JAR file. It is invoked as follows:

# java -jar <path to distribution jar> …

Invoking it with --help shows the command line options.

# java -jar jsp2x-VERSION-bin.jar --help
Usage:
Jsp2X [--help] [-c|--clobber] [(-o|--output) <output>] file1 file2 … fileN

Converts JSP pages to JSP documents (well-formed XML files with JSP tags).


[--help]
Prints this help message.

[-c|--clobber]
Overwrite output files even if they already exist.

[(-o|--output) <output>]
The path to the output folder. By default output files and logs are
created in the same directory as the input file.

file1 file2 … fileN
One or more paths to JSP files. Should not be absolute paths.

Unless you specify --clobber , Jsp2X will never overwrite existing files. For every input file it will create a converted output file and possibly a log file in the same directory of the input file unless the --output switch is specified. With --output <path> , output files are written to a directory structure underneath the directory specified by <path>. The directory structure will mimic the one of the input files and non-existing directories will be created on the fly as required. The base name of the output file will be derived from the input file using the following mapping between standard JSP page extensions and standard JSP document extensions:

Input extension	Output extension
jsp	jspx
tag	tagx
jspf	jspx

If the input file's extension doesn't match any of the ones listed in above table, Jsp2X will generate the output file name simply by appending .xml to the input file name.

Input file paths should always be relative paths. They must be relative paths if --output is specified. If they are relative paths they may start with './' but they don't need to, e.g. ./foo/bar.jsp is treated equivalent to foo/bar.jsp . JSP pages may include other JSP fragments. Jsp2X can handle this as long as the value of every include directive's uri attribute should point to the included file when prepending the uri value with the current working directory. In other words, you should

run Jsp2X from with the webapp directory of your source tree (usually src/main/webapp ) and
your JSP pages use context-relative URIs to refer to the included fragment, e.g. /WEB-INF/jsp/taglibs.jspf .

In all other cases Jsp2X will emit a warning and the conversion result might be incomplete.

A typical conversion session might look like this:

# cd src/main/webapp
# find -name "*.tag" -or -name "*.jsp" -or -name | 
  xargs java -jar jsp2x-VERSION-bin.jar --clobber
# cd ../../..

Jsp2X will print the total number of input files and the number of successfully converted input files. You will find as many log files as there are input files for which the conversion was unsuccessful. Read the log files and tweak the input pages or come running to me if you think you found a bug.

When converting the JSP pages in Provider Portal, I used a slightly more elaborate approach that yielded better diffs in SVN. The key to that approach is that I first renamed the JSP pages to their JSP document counterparts in one commit then replaced the content of the renamed file with its converted form in a second commit. The diff of the second commit lists all modifications made by Jsp2X allowing you to later go back and see what exactly it did. Here's a transcript of my conversion session (before you copy-and-paste it make sure you understand what's going on with all those find commands):

Convert all JSP files into a separate temporary directory:

# cd src/main/webapp
# find -name "*.tag" -or -name "*.jsp" -or -name | 
  xargs java -jar jsp2x-VERSION-bin.jar --clobber --output temp

Use find to generate a script that renames all JSP files:

# find \( -name  -or -name "*.jspf" \) -and -printf | 
  sed -r "s/jspf?\$/jspx/" | bash
# find -name "*.tag" -and -printf "svn rename %p %p\\n" | sed -r  | bash
# svn commit -m "…"

Use find to generate another script that copies the converted files from the tempotary directory to the real one:

# cd temp/WEB-INF
# find \( -name "*.tagx" -or -name "*.jspx" \) -and -printf  | sed s/\\/\\.\\//\\// | bash
# cd ../..
# rm -r temp
# svn commit -m "…"

How it works

Jsp2X is split into four main parts: the parser, the transformer, the dumper and the main class with some glue code for command line and file management. The parser was hardest to get right because unlike a true JSP page parser it can't just scan the template text for JSP constructs. The transformer needs a complete tree structure of the input including the tags in the template text. So the parser has to scan for markup in the template textand JSP constructs at the same time. The input is not just simple markup with elements, attributes and some text. JSP constructs can literally occur anywhere in the document. The parser needs to accept input like this:

<a href="<c:url value="foo"/>" ${isBold ? 'class="bold"' : ''}>

This is an <a> element with an href attribute whose value is a <c:url> tag which has more attributes. Next to the href attribute there is an EL expression with a conditional class attribute. I refer to these constructs as being recursive because tags are allowed within tags (this is different to elements occurring in the body of other elements). Also note the nesting of the quotation marks. As you can see, parsing this is not trivial. Luckily, I had a very powerful tool at hand: ANTLR. Given the grammar of an input language ANTLR generates the Java source code of a class that can parse the input language and turn it into an in-memory tree representing the input. So as long as you can come up with a grammar for the desired input, ANTLR generates a program that parses the input for you. ANTLR can generate source code for Java, C#, C and other languages. It supports complex LL(*) grammars (any context-free language if you know who Chomsky is) in which the decision about which grammar rule to apply can not be made by just looking a constant number of tokens ahead (it uses backtracking in conjunction with memoization to alleviate the exponential cost of backtracking). I am an ANTLR newbie so I expect my JSP grammar to have deficiencies.

The transformer is a simple recursive tree walker that can change, delete and add nodes during the walk. Most of the work is done in a first pass. It also detects and converts the afore-mentioned recursion in attributes and tags. The second pass combines consecutive PCDATA (i.e. text) nodes and escapes XML entities. The third pass attempts to detect insignificant whitespace. For example, it converts

<td>
    Foo
</td>

<td>
    <jsp:text>Foo<jsp:text>
<td>

The difference between the two fragments is that the first one would cause the JSP engine to emit HTML output that includes the whitespace:

<td>
    Foo
</td>

The second fragment on the other hand would emit

<td>Foo</td>

This is because the whitespace around "Foo" became whitespace-only text between tags and can be safely eliminated by the JSP engine. The text child of the <td> element in the first fragment contains both whitespace and non-whitespace. The JSP standard says that in JSP documents only text that exclusively consists of whitespace can be eliminated.

The dumper is a very simple XML serializer. After the transformer did its work, the tree is basically in XML form and serializing it is a trivial task. ANTLR supports tree parsing to some extent so I used that mechanism for the dumper.

There's not much to say about the main class, except maybe that it uses a neat little command line parser called JSAP.

Attachment	Size
jsp2x-0.9.1-SNAPSHOT-bin.jar	530.26 KB

Convert JSP pages to JSP documents (JSPX) with Jsp2x

Introduction

Where can I get it?

What exactly does it do?

Requirements

Usage

How it works

Menu

Who's online

AdSense

Convert JSP pages to JSP documents (JSPX) with Jsp2x

Introduction

Where can I get it?

What exactly does it do?

Requirements

Usage

How it works

Menu

User login

Who's online