Convert JSP pages to JSP documents (JSPX) with Jsp2x
Jsp2X is a command line utility for batch conversion of JSP pages to JSP documents, i.e. JSPs in well-formed XML syntax (aka JSPX, see chapter 5 of the JavaServer PagesTM 1.2 Specification and chapter 6 of the JavaServer PagesTM 2.0 Specification). It is written in Java and incorporates a parser derived from a combined JSP+XHTML grammar using the ANTLR parser generator. It tries very hard to create JSPX output that portable across engines. Jsp2X was designed to be used in an iterative fashion in which it alerts the user of potential problems in the input.
Introduction
Version 1.2 of the JSP standard introduces the notion of JSP documents which are simply JSP files in well-formed XML syntax. Files in traditional JSP format, also known as JSP pages contain a more or less free-form tag soup for which parsers are difficult to write and which are therefore hard to digest in an automated manner. It took a long time until the various JSP engine vendors agreed on what was valid JSP and what wasn't. I usually prefer the Jetty servlet container for testing a web application during development because it starts up quickly which reduces the time it takes to switch between coding and testing an application. When I later deploy that application to Resin I am bewildered to see Resin reject the JSPs that worked flawlessly in Jetty. An upgrade to Resin 3.0.23 fixes many discrepancies but I still end up tweaking my JSP pages to make them work in both containers.
JSP documents are well-formed XML. XML has a strict and precise (albeit verbose) syntax. There are plenty of parsers and other tools available for XML. Making your JSP files XML-compliant therefore opens a world of possibilities for further processing. For example, I have haven't found a single JSP editor that correctly formats and highlights anything but the simplest pages. With JSP documents these problem have a trivial solution: use your favorite XML editor.
Another annoying trait of JSP pages is that the JSP engine preserves insignificant whitespace. A JSP parser only parses what looks like a JSP tag or a directive even if the text in between is well-formed XML. For that reason it can't detect and remove whitespace that would be considered insignificant by XML or HTML standards. This unnecessarily increases the size of the emitted HTML. The more JSP code is factored out into tag files or included JSP fragments, the more insignificant whitespace generated and sent to the browser. In JSP documents, on the other hand, it is very easy to detect and drop insignificant whitespace. In fact, if the JSP engine uses an XML parser to read the input, the parser will take care of whitespace on behalf of the engine. To give you a rough idea about the potential savings: after I converted all 70+ JSP pages and tag files of a well-factored 100k SLOC web application to JSP documents, the average size of the HTML output decreased by 50% to 75%!
Taking into account that the template text in most JSP pages is in fact XHTML or HTML
the JSP committee realized that it isn't a very long road from a JSP page to a well-formed
XML document. They only had to get rid of the leniency in the JSP parser and come up with
alternatives for crazy constructs like <a href="<c:url …>">
. This thought process led to the definition of JSP documents in the JSP standard at time
when millions of JSP pages had already been written an deployed. This is where Jsp2X comes
in. It is a tool that assists in the conversion of JSP pages to JSP documents, a process
that is generally straight-forward but tends to be tedious and has the potential to
introduce subtle errors when executed by hand.
To understand what JspX does you need to keep in mind that unlike a JSP engine, Jsp2X parses both the JSP tags and the template text in between those tags. In that respect Jsp2X incorporates a more complex parser than what you'd find in a typical JSP engine (luckily, I had a very powerful and yet easy-to-use tool at hand: ANTLR, a robust LL(*) parser generator). More importantly, Jsp2X can successfully parse the template text in your JSP pages only if it is reasonably correct XHTML. Jsp2X doesn't expect fully well-formed XML template text. It requires that all tags are nested properly and that empty tags are closed correctly. There is no need for a single root element - Jsp2X will create one on-the-fly if necessary.
Where can I get it?
The latest binary and source distributions be downloaded from this page. To compile the sources you need Maven version 2.0.7 and a JDK 1.6.0_02. Older Maven 2.0 releases >= 2.0.4 may work as well and a recent 1.5 JDK should be fine, too. Jsp2X is released under the LGPL.
The usage of the binary distribution is described in section Usage.
The source code repository is hosted at Google Code.
What exactly does it do?
A conversion of a single JSP page requires a number of different transformations. The following is a hopefully complete list:
- Jsp2X writes the converted input to an output file whose name is derived from the input file. The extension of the output file name is mapped according to what the JSP standard lists as standard extensions for JSP pages/documents, tag files and fragments (also see Usage).
- Jsp2X adds four very short utility tag files to the converted project. They have
the
jspx:
prefix and contain functionality that would otherwise clutter the converted JSP document. - Jsp2X wraps the JSP page in a <jsp:root> tag.
- Jsp2X wraps JSP fragments into a <jspx:fragment> tag. <jsp:root> tags in fragments are disallowed so I had to come up with another tag that is transparent with respect to the generated output and that can be used to collect the potentially many top-level elements of a fragment underneath a single top-level element (a requirement of XML well-formedness).
- Jsp2X converts all taglib declarations to name space references on the new root element (<jsp:root> or <jsp:fragment>). Unused taglibs are omitted. Jsp2X even detects taglibs that are declared in a fragment that is included by the JSP page to be converted. JSP page authors often move their taglib declarations to a separate file that is then included at the top of every JSP page.
- Jsp2X escapes special XML characters in the input. Keep in mind that an JSP
document is parsed twice, once by the JSP engine's XML parser and once on the client
side by the browser's HTML/XHTML parser. If you wanted to display a literal
<
on a page, it was sufficient to put the HTML entity<
into the JSP page because the entity had no special meaning to the JSP parser. A JSP document would have to read&lt;
to get the desired effect. The JSP parser will substitute&
with&
such that the browser gets the intended<
; and renders that as<
. Jsp2X does the necessary escaping for you. - Jsp2X wraps template text in
<jsp:text>
tags, excluding insignificant whitespace. - Jsp2X escapes HTML comments and converts JSP comments to XML comments with the intended effect that HTML comments will end up in the output whereas JSP comments do not.
- Jsp2X wraps scriptlets and expressions in
<jsp:scriptlet>
and<jsp:expression>
tags respectively. - Jsp2X inserts escaped HTML comments into the body of elements with empty bodies to
prevent them from being collapsed into empty element:
<td></td>
becomes<td><!----></td>
. This is definitely noisy but I found no other way to prevent the JSP engine's XML parser from collapsing empty element bodies. One of the goals for Jsp2X was to preserve the intent of a JSP page as much as possible. Luckily, a typical HTML page doesn't contain that many empty elements so the added syntactic noise will be minimal. -
Jsp2X tries to detect and convert dynamic attribute constructs. The detection of
these constructs is not bullet-proof because Jsp2X does not have a full-blown EL
expression parser. Instead it uses regexes to detect the most common cases. The
table below lists the supported cases (with additional whitespace and indention for
clarity).
JSP page JSP document <foo x="<bar …>"> … </foo>
<jspx:element name="foo"> <jspx:attribute name="x"/><bar…></jspx:attribute> <jspx:body>…<jspx:body> </jspx:element>
<foo <c:if test="…">x="…"<c:if>> … </foo>
<jspx:element name="foo"> <c:if test="…"> <jspx:attribute name="x"/>…</jspx:attribute> <c:if> <jspx:body>…<jspx:body> </jspx:element>
<foo ${condition : 'x="…"' ? ''}> … </foo>
<jspx:element name="foo"> <c:if test="${condition}"> <jspx:attribute name="x"/>…</jspx:attribute> <c:if> <jspx:body>…<jspx:body> </jspx:element>
<foo ${condition : '' ? 'x="…"'}> … </foo>
<jspx:element name="foo"> <c:if test="${!(condition)}"> <jspx:attribute name="x"/>…</jspx:attribute> <c:if><jspx:body>…<jspx:body> </jspx:element>
<foo ${condition : 'x="…"' ? 'y="…"'}> … </foo>
<jspx:element name="foo"> <c:choose> <c:when test="${condition}"> <jspx:attribute name="x"/>…</jspx:attribute> <c:when> <c:otherwise> <jspx:attribute name="y"/>…</jspx:attribute> </c:otherwise> </c:choose> <jspx:body>…<jspx:body> </jspx:element>
- Jsp2X rewrites the file extension in references to an included file as long as the included file is also listed as an input file. This is why you should convert all JSP files in a single invocation of Jsp2X. If you don't Jsp2X will not be able to rewrite references to converted files.
- Jsp2X converts DOCTYPE declarations to <jsp:output> elements.
You might notice the use of <jspx:element>
and
<jspx:attribute>
tags where you'd expect JSP's built-in
<jsp:element>
and <jsp:attribute>
tags. The reason is
that the built-in mechanism doesn't work for conditional attributes (something I consider a
blatant oversight in the standard). For example,
<jsp:element …><c:if …><jsp:attribute
…>…</jsp:attribute></c:if></jsp:element>
doesn't work because the attribute element applies to the <c:if>
tag,
not the <jsp:element>
tag. It is in accordance with the standard but the
standard should have been written to accommodate this very common use case. Jsp2X creates
several tag files with custom tags that have similar functionality to
<jsp:element>
, <jsp:attribute>
and
<jsp:body>
but work for conditional attributes:
<jspx:element name="foo"><c:if …><jspx:attribute
name="bar">…</jsp:attribute></c:if></jsp:element>
.
Another difference is that <jspx:element>
distinguishes between empty
tags and tags with empty bodies. For example, a JSP page with
<jspx:element name="foo"><jsp:body/></jsp:element>
will emit <foo></foo>
and
<jspx:element name="foo"></jsp:element>
or
<jspx:element name="foo"/>
will emit <foo/>
. The jsp:
variant would have emitted
<foo/>
in either case. This is XML-compliant but violates HTML (not
XHTML) in which <div></div>
and <div/>
are
treated differently. The latter is actually disallowed and the its effect differs from
browser to browser. FF treats it like an opening <div>
and implicitly
closes it at the end of the parent tag, e.g.
<td><div class="a"/><div>foo</div><td>
is
treated like
<td><div
class="a"><div>foo</div></div></td>
.
IE7 simply ignores everything after the <div/>
.
The use of Jsp2X's custom <jspx:element>
instead of the built-in
<jsp:element>
assists in creating output that is more likely to preserve
the JSP page author's intent. It also enables the use of HTML (albeit a somewhat stricter
dialect of it) as opposed restricting the template text to pure XHTML.
Requirements
- mandatory: JDK 5 or higher
- recommended: JSP files named with standardized extensions (
.tag
,.jsp
and.jspf
. - recommended: Access to the complete set of all JSP files that comprise the web application (i.e. everything underneath the WEB-INF directory).
- recommended: The include directives in every input JSP page should use
context-relative URIs to refer to other JSP files (as in
/WEB-INF/jsp/taglibs.jspf
).
Usage
Jsp2X is distributed as an executable JAR file. It is invoked as follows:
# java -jar <path to distribution jar> …
Invoking it with --help
shows the command line options.
# java -jar jsp2x-VERSION-bin.jar --help Usage: Jsp2X [--help] [-c|--clobber] [(-o|--output) <output>] file1 file2 … fileN Converts JSP pages to JSP documents (well-formed XML files with JSP tags). [--help] Prints this help message. [-c|--clobber] Overwrite output files even if they already exist. [(-o|--output) <output>] The path to the output folder. By default output files and logs are created in the same directory as the input file. file1 file2 … fileN One or more paths to JSP files. Should not be absolute paths.
Unless you specify --clobber
, Jsp2X will never overwrite existing files.
For every input file it will create a converted output file and possibly a log file in the
same directory of the input file unless the --output switch is specified. With
--output <path>
, output files are written to a directory structure
underneath the directory specified by <path>. The directory structure will mimic the
one of the input files and non-existing directories will be created on the fly as required.
The base name of the output file will be derived from the input file using the following
mapping between standard JSP page extensions and standard JSP document extensions:
Input extension | Output extension |
---|---|
jsp | jspx |
tag | tagx |
jspf | jspx |
If the input file's extension doesn't match any of the ones listed in above table, Jsp2X
will generate the output file name simply by appending .xml
to the input file
name.
Input file paths should always be relative paths. They must be relative paths if
--output is specified. If they are relative paths they may start with './'
but
they don't need to, e.g. ./foo/bar.jsp
is treated equivalent to
foo/bar.jsp
. JSP pages may include other JSP fragments. Jsp2X can handle this
as long as the value of every include directive's uri
attribute should point
to the included file when prepending the uri
value with the current working
directory. In other words, you should
- run Jsp2X from with the
webapp
directory of your source tree (usuallysrc/main/webapp
) and - your JSP pages use context-relative URIs to refer to the included fragment, e.g.
/WEB-INF/jsp/taglibs.jspf
.
In all other cases Jsp2X will emit a warning and the conversion result might be incomplete.
A typical conversion session might look like this:
# cd src/main/webapp # find -name "*.tag" -or -name "*.jsp" -or -name | xargs java -jar jsp2x-VERSION-bin.jar --clobber # cd ../../..
Jsp2X will print the total number of input files and the number of successfully converted input files. You will find as many log files as there are input files for which the conversion was unsuccessful. Read the log files and tweak the input pages or come running to me if you think you found a bug.
When converting the JSP pages in Provider Portal, I used a slightly more elaborate approach that yielded better diffs in SVN. The key to that approach is that I first renamed the JSP pages to their JSP document counterparts in one commit then replaced the content of the renamed file with its converted form in a second commit. The diff of the second commit lists all modifications made by Jsp2X allowing you to later go back and see what exactly it did. Here's a transcript of my conversion session (before you copy-and-paste it make sure you understand what's going on with all those find commands):
-
Convert all JSP files into a separate temporary directory:
# cd src/main/webapp # find -name "*.tag" -or -name "*.jsp" -or -name | xargs java -jar jsp2x-VERSION-bin.jar --clobber --output temp
-
Use find to generate a script that renames all JSP files:
# find \( -name -or -name "*.jspf" \) -and -printf | sed -r "s/jspf?\$/jspx/" | bash # find -name "*.tag" -and -printf "svn rename %p %p\\n" | sed -r | bash # svn commit -m "…"
-
Use find to generate another script that copies the converted files from the
tempotary directory to the real one:
# cd temp/WEB-INF # find \( -name "*.tagx" -or -name "*.jspx" \) -and -printf | sed s/\\/\\.\\//\\// | bash # cd ../.. # rm -r temp # svn commit -m "…"
How it works
Jsp2X is split into four main parts: the parser, the transformer, the dumper and the main class with some glue code for command line and file management. The parser was hardest to get right because unlike a true JSP page parser it can't just scan the template text for JSP constructs. The transformer needs a complete tree structure of the input including the tags in the template text. So the parser has to scan for markup in the template textand JSP constructs at the same time. The input is not just simple markup with elements, attributes and some text. JSP constructs can literally occur anywhere in the document. The parser needs to accept input like this:
<a href="<c:url value="foo"/>" ${isBold ? 'class="bold"' :
''}>
This is an <a>
element with an href
attribute whose
value is a <c:url>
tag which has more attributes. Next to the
href
attribute there is an EL expression with a conditional class attribute. I
refer to these constructs as being recursive because tags are allowed within tags
(this is different to elements occurring in the body of other elements). Also note
the nesting of the quotation marks. As you can see, parsing this is not trivial. Luckily, I
had a very powerful tool at hand: ANTLR. Given the grammar of an input language ANTLR
generates the Java source code of a class that can parse the input language and turn it
into an in-memory tree representing the input. So as long as you can come up with a grammar
for the desired input, ANTLR generates a program that parses the input for you. ANTLR can
generate source code for Java, C#, C and other languages. It supports complex LL(*)
grammars (any context-free language if you know who Chomsky is) in which the decision about
which grammar rule to apply can not be made by just looking a constant number of tokens
ahead (it uses backtracking in conjunction with memoization to alleviate the exponential
cost of backtracking). I am an ANTLR newbie so I expect my JSP grammar to have
deficiencies.
The transformer is a simple recursive tree walker that can change, delete and add nodes during the walk. Most of the work is done in a first pass. It also detects and converts the afore-mentioned recursion in attributes and tags. The second pass combines consecutive PCDATA (i.e. text) nodes and escapes XML entities. The third pass attempts to detect insignificant whitespace. For example, it converts
<td> Foo </td>
to
<td> <jsp:text>Foo<jsp:text> <td>
The difference between the two fragments is that the first one would cause the JSP engine to emit HTML output that includes the whitespace:
<td> Foo </td>
The second fragment on the other hand would emit
<td>Foo</td>
This is because the whitespace around "Foo" became whitespace-only text between
tags and can be safely eliminated by the JSP engine. The text child of the
<td>
element in the first fragment contains both whitespace and
non-whitespace. The JSP standard says that in JSP documents only text that exclusively
consists of whitespace can be eliminated.
The dumper is a very simple XML serializer. After the transformer did its work, the tree is basically in XML form and serializing it is a trivial task. ANTLR supports tree parsing to some extent so I used that mechanism for the dumper.
There's not much to say about the main class, except maybe that it uses a neat little command line parser called JSAP.
Attachment | Size |
---|---|
jsp2x-0.9.1-SNAPSHOT-bin.jar | 530.26 KB |