Extracting JPGs from PDFs

Thursday 6 December 2007

I was trying to diagnose a problem with a PDF file we generated yesterday, and suspected that the images were corrupted. To see, I wrote this quick script to extract JPGs from PDF files. It is quick and dirty, with the absolute minimum understanding of PDF files, which can be quite opaque.

# Extract jpg's from pdf's. Quick and dirty.
import sys

pdf = file(sys.argv[1], "rb").read()

startmark = "\xff\xd8"
startfix = 0
endmark = "\xff\xd9"
endfix = 2
i = 0

njpg = 0
while True:
    istream = pdf.find("stream", i)
    if istream < 0:
    istart = pdf.find(startmark, istream, istream+20)
    if istart < 0:
        i = istream+20
    iend = pdf.find("endstream", istart)
    if iend < 0:
        raise Exception("Didn't find end of stream!")
    iend = pdf.find(endmark, iend-20)
    if iend < 0:
        raise Exception("Didn't find end of JPG!")
    istart += startfix
    iend += endfix
    print "JPG %d from %d to %d" % (njpg, istart, iend)
    jpg = pdf[istart:iend]
    jpgfile = file("jpg%d.jpg" % njpg, "wb")
    njpg += 1
    i = iend

This script works for my PDF files. Maybe it doesn't work for all, I don't know. PDF files are complex beasts. Your mileage may vary.

What I'd really like is a tool for exploring inside PDF files, so that I could see exactly what's going on in there. pyPdf is a start, but only scratches the surface of the kind of stuff I'd like to see...


David Boddie 12:58 PM on 6 Dec 2007

I decided to try and do something similar with pdftools:


However, the code is also quite long and assumes certain things about the file it is given. It doesn't look like I can post code here, either. :-(

andrew 3:41 PM on 6 Dec 2007

PDFPeek, The New Obsession

Lee Harr 5:22 PM on 6 Dec 2007

Nice. I needed to do this just the other day. I tried pdftk, which
normally has all of the pdf tools that I need, but I could not see
how to make it extract the images. I went back to the pdf just
now to test out your script and it worked perfectly.

susan 7:27 PM on 6 Dec 2007

PDF = Pretty Darned Fun
JPG = Just PlayinG

David Boddie 8:23 PM on 6 Dec 2007

In the end, I ran my code through a syntax highlighter and put it here:


It even tries to decode images that it thinks are PNGs, but only in a restricted set of circumstances.

Eric Florenzano 1:59 AM on 7 Dec 2007

I've not messed with the PDF specification at all, but some of my friends had an assignment to write a PDF parser in C, and complained about it loudly. It was enough to make me never want to try :)

Mathieu Fenniak 6:08 PM on 7 Dec 2007

Hi Ned,

I'm the author of pyPdf. I've often wished for a program to explore inside PDF files myself. But, typically I need it for files that pyPdf can't open, so I've never developed one using pyPdf. ;-)

I think that pyPdf has the necessary core functionality that would be needed for a general purpose PDF library, but it clearly lacks in the implementation of useful functions based upon that core. For example, it can read and decode stream objects, but there's no function that just says "Get me all the images on this page". Someday...

Thanks for the link and exposure.

abdul 9:12 AM on 3 Feb 2008

Hi and thanks for your code.
I have a question about other graphic formats: how can I find png,gif,tiff etc etc as they have only a "start" magic number, but no end mark?

Vasudev Ram 10:35 AM on 3 Feb 2010

Saw this post only now, but PDFEdit may work.


mark stephens 9:31 AM on 25 Apr 2010

Acrobat 9 has a really nice tool to see the objects inside a PDF - there is an article explaining it at http://pdf.jpedal.org/java-pdf-blog/bid/10479/Viewing-PDF-objects

Samuel Komfi 8:22 AM on 6 Oct 2010

Cool men. it worked on mine. awesome.

j 1:59 PM on 4 Feb 2011

Could you explain the program ?

I mean, what are those marks ? "\xff\xd8" and "\xff\xd9"

Is a PDF header or JPG header ? an hexdump of a jpg shows something different.

The program works.

Francesc Via 10:44 AM on 9 Jun 2012

Thanks a lot, I need to extract jpg from pdf's for a worklflow, your code is really fast.

Aparna 6:39 AM on 2 Oct 2012

Can we convert first page of PDF files into jpeg with the help of Image Magick,or is there any dependencies of Image Magick on Ghost Script.

Aparna 6:41 AM on 2 Oct 2012

I think conversion of PDF file into jpeg is dependent on Ghost Script,but I am not sure.how can we do this with the help of above code.is this code is useful when we will use Image Magick .

Roma 12:39 PM on 23 May 2013

They are the starting and ending bytes of a JPEG image. There is some slight variation of this so that might explain the appearance of your data. http://en.wikipedia.org/wiki/JPEG#Syntax_and_structure

Brice 7:12 PM on 19 May 2014

good good good, I can now extract images from pdf because of you, thank for sharing man!

Russell 4:10 AM on 20 Jul 2014

WoW!!! Holy cow, I just put that script together and ran it against a pdf and it was amazing how quickly it worked. Thank you so much.

Chris Jones 7:06 AM on 30 Apr 2015

To get this working with Python 3, I had to change the file() functions to open() and change all the pdf.find strings to bytes objects by putting b on the front of each literal, for example b"stream".

Peter Muller 10:17 AM on 16 May 2015

Great Stuff! Thanks, Peter

mandy 12:54 PM on 9 Jun 2015

I tried using this code to extract jpegs from pdf , it extracts the images and reverses the color of the document and makes them black and white.

Vanya 10:04 PM on 15 Jun 2015

Hi all,

I have a scanned document in pdf format and I tried the suggested code, unfortunately it did not work. I also try to make the suggestions with changing the strings to bytes, no luck there either. Is there anything else I should be aware of with new editions of python.

Thank you in advance,

amit 9:21 PM on 27 Aug 2015

Works well, to work with py3 change all string and find comparison to byte. startmark = b'\xff\xd8' and istream = pdf.find(b'stream', i). Dont forget to change file() to open() and parenthesis in print.

JonyGreen 8:24 AM on 8 Oct 2015

I'm not a developer, i always use this free online pdf image extractor(http://www.online-code.net/pdf-to-image.html)

Adele 1:29 PM on 9 Nov 2015

Just ever tried an online PDF to image converter to extract jpeg images from PDF file.
Check out: http://www.pqscan.com/pdf-to-image/.

Ryan 4:36 AM on 10 Aug 2016

Python3 code:

# coding=utf-8
# Extract jpg's from pdf's. Quick and dirty.

import sys

with open(sys.argv[1], "rb") as file:
    pdf = file.read()

startmark = b"\xff\xd8"
startfix = 0
endmark = b"\xff\xd9"
endfix = 2
i = 0

njpg = 0
while True:
    istream = pdf.find(b"stream", i)
    if istream < 0:
    istart = pdf.find(startmark, istream, istream + 20)
    if istart < 0:
        i = istream + 20
    iend = pdf.find(b"endstream", istart)
    if iend < 0:
        raise Exception("Didn't find end of stream!")
    iend = pdf.find(endmark, iend - 20)
    if iend < 0:
        raise Exception("Didn't find end of JPG!")

    istart += startfix
    iend += endfix
    print("JPG %d from %d to %d" % (njpg, istart, iend))
    jpg = pdf[istart:iend]
    with open("jpg%d.jpg" % njpg, "wb") as jpgfile:

    njpg += 1
    i = iend

Andrzej AJO 11:40 AM on 23 Jan 2017

Created a small windows and linux standalone application, using your code.

Gerhard Scheidl 11:31 AM on 20 Feb 2017

Awesome ... thanks! This code works for me as well!
Thank you for sharing!

Add a comment:

Ignore this:
not displayed and no spam.
Leave this empty:
not searched.
Name and either email or www are required.
Don't put anything here:
Leave this empty:
URLs auto-link and some tags are allowed: <a><b><i><p><br><pre>.