Extracting CCITTFaxDecode images from PDF
Posted 2011-11-27
Updated 2011-12-03: Wouldn't you know, the same day I post this iTextSharp 5.1.3 is released. The first code example uses the changes in 5.1.3. The second code example uses the version 5.0 - 5.1.2 API. (it will also work with 5.1.3.
Wow, it's been a while since I've done anything new on the site; changed jobs in May of this year (2011) and I can't say it's been for the better. The good news is I'm still overseas, but the bad news is that the job is a developer black hole; I'm stuck doing everything but development. Well, enough of my bitching - all things considered I'm VERY grateful to be gainfully employed the way the world economy is at the moment. And I shouldn't have gotten my hopes too high anyway - working for the DOD is always an "adventure", and I should know that by now...
So to get to the point, last week I was browsing the iTextSharp tagged questions on Stack Overflow and found something interesting. Usually I don't bother with questions from new users; it's sad but pretty obvious a good number of questions are coming from people looking for a free handout to have someone else to do their work. Or in many other instances a user has a low accept rate and creates a new account in the hopes that people won't notice that they asked the exact same question a few days ago. (and ignore them) On the bright side, Stack Overflow doesn't have an answer spamming problem like the ASP.NET site :)
In this case the user was too lazy to do a simple search and as a result the question was closed by a moderator as a duplicate question. The interesting part was that the link to the prior question claimed that iTextSharp could not extract images from PDFs with a /CCITTFaxDecode filter. I've never had the need to do anything like that, but chekcing the iText mailing list archives, I found that the claim was substantiated here, here, here, here, here, and here.
I can't remember when, (again I haven't had the opportunity to use iText for over seven or eight months) but PDF parsing support was added sometime after iText version 5 was released. So I figured there was a good chance it could be done, especially since the examples referenced above were all trying to directly access the raw bytes of the PDF via the page content stream, instead of trying the new API.
The magic to parse and interpret PDF content can be found in the iTextSharp.text.pdf.parser namespace. (Java com.itextpdf.text.pdf.parser package) And although the development team admits that it's not perfect yet, this means in most cases you can now extract both text and images.
The basic step-by-step process:
- Create a listener class that implements the IRenderListener interface. Since we're only extracting images, we only need to implement the RenderImage method.
- Instantiate a PdfReader instance of the PDF document you're trying to parse.
- Instantiate a PdfReaderContentParser parser with the
PdfReader.
- Instantiate an instance of the concrete class written in step [1] above.
- Iterate over all pages in the PDF document; on each page call the
PdfReaderContentParser's ProcessContent method.
Caveat: Only two PDF test cases were used. You may find other border-case PDFs out in the wild. With that said, the inline comments should wrap things up:
5.1.3 Code Example
<%@ WebHandler Language="C#" Class="CCITTFaxDecodeExtract513" %>
using System;
using System.Collections.Generic;
using System.IO;
using System.Web;
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
public class CCITTFaxDecodeExtract513 : IHttpHandler {
public void ProcessRequest (HttpContext context) {
HttpServerUtility Server = context.Server;
HttpResponse Response = context.Response;
string[] pdfs = {
"CCITTFaxDecode.pdf", "CCITTFaxDecode-01.pdf"
};
foreach (string pdf in pdfs) {
string file = Server.MapPath("~/app_data/" + pdf);
PdfReader reader = new PdfReader(file);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
MyImageRenderListener listener = new MyImageRenderListener();
for (int i = 1; i <= reader.NumberOfPages; i++) {
parser.ProcessContent(i, listener);
}
for (int i = 0; i < listener.Images.Count; ++i) {
string path = Server.MapPath("~/app_data/" + listener.ImageNames[i]);
using (FileStream fs = new FileStream(
path, FileMode.Create, FileAccess.Write
))
{
fs.Write(listener.Images[i], 0, listener.Images[i].Length);
}
}
}
}
public bool IsReusable { get { return false; } }
/*
* see: TextRenderInfo & RenderListener classes here:
* http://api.itextpdf.com/itext/
*
* and Google "itextsharp extract images"
*/
public class MyImageRenderListener : IRenderListener {
public void RenderText(TextRenderInfo renderInfo) { }
public void BeginTextBlock() { }
public void EndTextBlock() { }
public List<byte[]> Images = new List<byte[]>();
public List<string> ImageNames = new List<string>();
public void RenderImage(ImageRenderInfo renderInfo) {
PdfImageObject image = renderInfo.GetImage();
try {
image = renderInfo.GetImage();
if (image == null) return;
ImageNames.Add(string.Format(
"Image{0}.{1}", renderInfo.GetRef().Number, image.GetFileType()
));
using (MemoryStream ms = new MemoryStream(image.GetImageAsBytes())) {
Images.Add(ms.ToArray());
}
}
catch (IOException ie) {
/*
* pass-through; image type not supported by iText[Sharp]; e.g. jbig2
*/
}
}
}
}
5.0 - 5.1.2 Code Example
<%@ WebHandler Language='C#' Class='CCITTFaxDecodeExtract' %>
using System;
using System.Collections.Generic;
using System.Drawing.Imaging;
using System.IO;
using System.Web;
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using Dotnet = System.Drawing.Image;
public class CCITTFaxDecodeExtract : IHttpHandler {
public void ProcessRequest (HttpContext context) {
HttpServerUtility Server = context.Server;
HttpResponse Response = context.Response;
/*
* sanity check skipped for this example; you __need__
* to add something
*/
string fileParam = context.Request.QueryString[0];
string file = Server.MapPath(string.Format(
"~/app_data/{0}", fileParam
));
PdfReader reader = new PdfReader(file);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
MyImageRenderListener listener = new MyImageRenderListener();
for (int i = 1; i <= reader.NumberOfPages; i++) {
parser.ProcessContent(i, listener);
}
for (int i = 0; i < listener.Images.Count; ++i) {
string path = Server.MapPath("~/app_data/" + listener.ImageNames[i]);
using (FileStream fs = new FileStream(
path, FileMode.Create, FileAccess.Write
))
{
fs.Write(listener.Images[i], 0, listener.Images[i].Length);
}
}
}
public bool IsReusable { get { return false; } }
/*
* image extraction code
*/
public class MyImageRenderListener : IRenderListener {
public void RenderText(TextRenderInfo renderInfo) { }
public void BeginTextBlock() { }
public void EndTextBlock() { }
public List<byte[]> Images = new List<byte[]>();
public List<string> ImageNames = new List<string>();
public void RenderImage(ImageRenderInfo renderInfo) {
PdfImageObject image = renderInfo.GetImage();
PdfName filter = image.Get(PdfName.FILTER) as PdfName;
/*
* typically filter will __NOT__ be null for "normal" images like
* jpg/gif/tiff/png/bmp/other, and also in some PDFs where the
* dictionary's Filter is /CCITTFaxDecode
*/
if (filter == null) {
/*
* this tip comes directly from the mailing list;
* __sometimes__ the filter entry in the dictionary is an array!
* http://www.mail-archive.com/itext-questions@lists.sourceforge.net/msg58314.html
*/
PdfArray pa = (PdfArray) image.Get(PdfName.FILTER);
for (int i = 0; i < pa.Size; ++i) {
filter = pa[i] as PdfName;
/*
* for this example we're making the assumption there's only __one__
* image of this type on the page
*/
if (filter != null && PdfName.CCITTFAXDECODE.Equals(filter)) {
break;
}
}
}
if (PdfName.CCITTFAXDECODE.Equals(filter)) {
using (Dotnet dotnetImg = image.GetDrawingImage()) {
if (dotnetImg != null) {
ImageNames.Add(string.Format(
"{0}.tiff", renderInfo.GetRef().Number)
);
using (MemoryStream ms = new MemoryStream()) {
dotnetImg.Save(
ms, ImageFormat.Tiff);
Images.Add(ms.ToArray());
}
}
}
}
}
}
}
Notes Specific to the 5.0 - 5.1.2 Code Example
As mentioned above only had two PDF test cases were used. As noted in the inline comments, the Filter entry in one PDF was a PdfArray, and in the second PDF it was a PdfName.