Why Bother?
Posted 2007-01.
That's simple; say your organization has a large library of PDF documents and the documents have hundreds or thousands of pages with many embedded images. It would be nice if to give users an option to preview a table of contents before they download that 50MB file...
If the PDF was created with Bookmarks, iTextSharp / iText makes it a trivial three-step task to get an XML document we can work with to create a HTML table of contents:
- Read the PDF file using a PdfReader
- Dump the bookmarks into a collection with SimpleBookmark.GetBookmark()
- Generate an XML version of the bookmarks using SimpleBookmark.ExportToXML()
On to the Code
Simple stuff, as always w/inline comments:
public string build_toc() {
using (MemoryStream ms = new MemoryStream()) {
// three-steps above rolled into one
// XML document generated by iText
SimpleBookmark.ExportToXML(
// get all bookmarks
SimpleBookmark.GetBookmark(
// read PDF & optimize memory usage
new PdfReader(new RandomAccessFileOrArray("svn-book.pdf"), null)
),
ms,
"UTF-8",
false
);
// rewind to create xmlreader
ms.Position = 0;
StringBuilder sb = new StringBuilder();
using (XmlReader xr = XmlReader.Create(ms)) {
xr.MoveToContent();
string page = null; // save page number for link
string text = null; // link text from PDF bookmark
// see notes below for actual link format
string format = @"<li><a href='#page={0}'>{1}</a></li>";
// extract page number from 'Page' attribute
Regex re = new Regex(@"^\d+");
while (xr.Read()) {
if (xr.NodeType == XmlNodeType.Element
&& xr.Name == "Title"
&& xr.IsStartElement()
) {
sb.Append("<ul>");
// in production app separate steps:
// if GetAttribute() returns null if attr not found,
// which makes Regex.Match choke and die
page = re.Match(xr.GetAttribute("Page")).Captures[0].Value;
xr.Read();
// hyperlink text
if (xr.NodeType == XmlNodeType.Text) {
text = xr.Value.Trim();
// in production app verify page & text
// aren't empty before appending
sb.Append(String.Format(format, page, text));
}
}
// close current (HTML) list
if (xr.NodeType == XmlNodeType.EndElement
&& xr.Name == "Title"
) {
sb.Append(@"</ul>");
}
}
return sb.ToString();
}
}
}
Notes
- Don't forget to rewind the
MemoryStream by setting it's Position property - otherwise you can't create the XmlReader.
- Since we're reading large PDF files, we're using this PdfReader() constructor to save some overhead.
- It should be obvious, but after we get the XML document using iText's
SimpleBookmark.ExportToXML() all we're doing is using a SAX parser to build one big HTML string. The first thing that came to mind was XSLT, but we need to extract the page number from the Title element's Page attribute to create the link - take a look at the XML document (again, generated by iText's ExportToXML()) to see why.
- The example PDF document used is the Subversion book. The resultant table of contents is here. The hyperlinks ARE INTENTIONALLY NOT FUNCTIONAL, they should look like:
<a href='PDF_URI#page=PAGE_NO'>Link</a>
(if that doesn't make sense, see Linking to Pages or Destinations Within PDFs)