![]() | Ned Batchelder : Blog | Code | Text | Site Extracting JPGs from PDFs » Home : Blog : December 2007 |
Extracting JPGs from PDFsThursday 6 December 2007 I was trying to diagnose a problem with a PDF file we generated yesterday, and suspected that the images were corrupted. To see, I wrote this quick script to extract JPGs from PDF files. It is quick and dirty, with the absolute minimum understanding of PDF files, which can be quite opaque. # Extract jpg's from pdf's. Quick and dirty. This script works for my PDF files. Maybe it doesn't work for all, I don't know. PDF files are complex beasts. Your mileage may vary. What I'd really like is a tool for exploring inside PDF files, so that I could see exactly what's going on in there. pyPdf is a start, but only scratches the surface of the kind of stuff I'd like to see... | |
Comments
I decided to try and do something similar with pdftools:
http://www.boddie.org.uk/david/Projects/Python/pdftools/
However, the code is also quite long and assumes certain things about the file it is given. It doesn't look like I can post code here, either. :-(
PDFPeek, The New Obsession
Nice. I needed to do this just the other day. I tried pdftk, which
normally has all of the pdf tools that I need, but I could not see
how to make it extract the images. I went back to the pdf just
now to test out your script and it worked perfectly.
PDF = Pretty Darned Fun
JPG = Just PlayinG
In the end, I ran my code through a syntax highlighter and put it here:
http://chaos.troll.no/~dboddie/Python/pdf_extract_jpegs.html
It even tries to decode images that it thinks are PNGs, but only in a restricted set of circumstances.
I've not messed with the PDF specification at all, but some of my friends had an assignment to write a PDF parser in C, and complained about it loudly. It was enough to make me never want to try :)
Hi Ned,
I'm the author of pyPdf. I've often wished for a program to explore inside PDF files myself. But, typically I need it for files that pyPdf can't open, so I've never developed one using pyPdf. ;-)
I think that pyPdf has the necessary core functionality that would be needed for a general purpose PDF library, but it clearly lacks in the implementation of useful functions based upon that core. For example, it can read and decode stream objects, but there's no function that just says "Get me all the images on this page". Someday...
Thanks for the link and exposure.
Hi and thanks for your code.
I have a question about other graphic formats: how can I find png,gif,tiff etc etc as they have only a "start" magic number, but no end mark?
Bye
Saw this post only now, but PDFEdit may work.
http://pdfedit.petricek.net/en/index.html
Acrobat 9 has a really nice tool to see the objects inside a PDF - there is an article explaining it at http://pdf.jpedal.org/java-pdf-blog/bid/10479/Viewing-PDF-objects
Cool men. it worked on mine. awesome.
Could you explain the program ?
I mean, what are those marks ? "\xff\xd8" and "\xff\xd9"
Is a PDF header or JPG header ? an hexdump of a jpg shows something different.
The program works.
Thanks a lot, I need to extract jpg from pdf's for a worklflow, your code is really fast.
Can we convert first page of PDF files into jpeg with the help of Image Magick,or is there any dependencies of Image Magick on Ghost Script.
I think conversion of PDF file into jpeg is dependent on Ghost Script,but I am not sure.how can we do this with the help of above code.is this code is useful when we will use Image Magick .
Add a comment: