Extracting JPGs from PDFs

Thursday 6 December 2007This is nearly 17 years old. Be careful.

I was trying to diagnose a problem with a PDF file we generated yesterday, and suspected that the images were corrupted. To see, I wrote this quick script to extract JPGs from PDF files. It is quick and dirty, with the absolute minimum understanding of PDF files, which can be quite opaque.

# Extract jpg's from pdf's. Quick and dirty.
import sys

pdf = file(sys.argv[1], "rb").read()

startmark = "\xff\xd8"
startfix = 0
endmark = "\xff\xd9"
endfix = 2
i = 0

njpg = 0
while True:
    istream = pdf.find("stream", i)
    if istream < 0:
        break
    istart = pdf.find(startmark, istream, istream+20)
    if istart < 0:
        i = istream+20
        continue
    iend = pdf.find("endstream", istart)
    if iend < 0:
        raise Exception("Didn't find end of stream!")
    iend = pdf.find(endmark, iend-20)
    if iend < 0:
        raise Exception("Didn't find end of JPG!")
    
    istart += startfix
    iend += endfix
    print "JPG %d from %d to %d" % (njpg, istart, iend)
    jpg = pdf[istart:iend]
    jpgfile = file("jpg%d.jpg" % njpg, "wb")
    jpgfile.write(jpg)
    jpgfile.close()
    
    njpg += 1
    i = iend

This script works for my PDF files. Maybe it doesn’t work for all, I don’t know. PDF files are complex beasts. Your mileage may vary.

What I’d really like is a tool for exploring inside PDF files, so that I could see exactly what’s going on in there. pyPdf is a start, but only scratches the surface of the kind of stuff I’d like to see...

Comments

[gravatar]
I decided to try and do something similar with pdftools:

http://www.boddie.org.uk/david/Projects/Python/pdftools/

However, the code is also quite long and assumes certain things about the file it is given. It doesn't look like I can post code here, either. :-(
[gravatar]
PDFPeek, The New Obsession
[gravatar]
Nice. I needed to do this just the other day. I tried pdftk, which
normally has all of the pdf tools that I need, but I could not see
how to make it extract the images. I went back to the pdf just
now to test out your script and it worked perfectly.
[gravatar]
PDF = Pretty Darned Fun
JPG = Just PlayinG
[gravatar]
In the end, I ran my code through a syntax highlighter and put it here:

http://chaos.troll.no/~dboddie/Python/pdf_extract_jpegs.html

It even tries to decode images that it thinks are PNGs, but only in a restricted set of circumstances.
[gravatar]
I've not messed with the PDF specification at all, but some of my friends had an assignment to write a PDF parser in C, and complained about it loudly. It was enough to make me never want to try :)
[gravatar]
Hi Ned,

I'm the author of pyPdf. I've often wished for a program to explore inside PDF files myself. But, typically I need it for files that pyPdf can't open, so I've never developed one using pyPdf. ;-)

I think that pyPdf has the necessary core functionality that would be needed for a general purpose PDF library, but it clearly lacks in the implementation of useful functions based upon that core. For example, it can read and decode stream objects, but there's no function that just says "Get me all the images on this page". Someday...

Thanks for the link and exposure.
[gravatar]
Hi and thanks for your code.
I have a question about other graphic formats: how can I find png,gif,tiff etc etc as they have only a "start" magic number, but no end mark?
Bye
[gravatar]
Acrobat 9 has a really nice tool to see the objects inside a PDF - there is an article explaining it at http://pdf.jpedal.org/java-pdf-blog/bid/10479/Viewing-PDF-objects
[gravatar]
Cool men. it worked on mine. awesome.
[gravatar]
Could you explain the program ?

I mean, what are those marks ? "\xff\xd8" and "\xff\xd9"

Is a PDF header or JPG header ? an hexdump of a jpg shows something different.

The program works.
[gravatar]
Thanks a lot, I need to extract jpg from pdf's for a worklflow, your code is really fast.
[gravatar]
Can we convert first page of PDF files into jpeg with the help of Image Magick,or is there any dependencies of Image Magick on Ghost Script.
[gravatar]
I think conversion of PDF file into jpeg is dependent on Ghost Script,but I am not sure.how can we do this with the help of above code.is this code is useful when we will use Image Magick .
[gravatar]
They are the starting and ending bytes of a JPEG image. There is some slight variation of this so that might explain the appearance of your data. http://en.wikipedia.org/wiki/JPEG#Syntax_and_structure
[gravatar]
good good good, I can now extract images from pdf because of you, thank for sharing man!
[gravatar]
WoW!!! Holy cow, I just put that script together and ran it against a pdf and it was amazing how quickly it worked. Thank you so much.
[gravatar]
To get this working with Python 3, I had to change the file() functions to open() and change all the pdf.find strings to bytes objects by putting b on the front of each literal, for example b"stream".
[gravatar]
Great Stuff! Thanks, Peter
[gravatar]
I tried using this code to extract jpegs from pdf , it extracts the images and reverses the color of the document and makes them black and white.
[gravatar]
Hi all,

I have a scanned document in pdf format and I tried the suggested code, unfortunately it did not work. I also try to make the suggestions with changing the strings to bytes, no luck there either. Is there anything else I should be aware of with new editions of python.

Thank you in advance,
Vanya
[gravatar]
Works well, to work with py3 change all string and find comparison to byte. startmark = b'\xff\xd8' and istream = pdf.find(b'stream', i). Dont forget to change file() to open() and parenthesis in print.
[gravatar]
I'm not a developer, i always use this free online pdf image extractor(http://www.online-code.net/pdf-to-image.html)
[gravatar]
Just ever tried an online PDF to image converter to extract jpeg images from PDF file.
Check out: http://www.pqscan.com/pdf-to-image/.
[gravatar]
Python3 code:
# coding=utf-8
# Extract jpg's from pdf's. Quick and dirty.

import sys

with open(sys.argv[1], "rb") as file:
    pdf = file.read()

startmark = b"\xff\xd8"
startfix = 0
endmark = b"\xff\xd9"
endfix = 2
i = 0

njpg = 0
while True:
    istream = pdf.find(b"stream", i)
    if istream < 0:
        break
    istart = pdf.find(startmark, istream, istream + 20)
    if istart < 0:
        i = istream + 20
        continue
    iend = pdf.find(b"endstream", istart)
    if iend < 0:
        raise Exception("Didn't find end of stream!")
    iend = pdf.find(endmark, iend - 20)
    if iend < 0:
        raise Exception("Didn't find end of JPG!")

    istart += startfix
    iend += endfix
    print("JPG %d from %d to %d" % (njpg, istart, iend))
    jpg = pdf[istart:iend]
    with open("jpg%d.jpg" % njpg, "wb") as jpgfile:
        jpgfile.write(jpg)

    njpg += 1
    i = iend
[gravatar]
Awesome ... thanks! This code works for me as well!
Thank you for sharing!
[gravatar]
Max A H Hartvigsen 2:44 AM on 6 Jun 2017
Been trying to use this script on my test.pdf file. However I get the following error which stops me in my tracks.

, line 9, in
with open(sys.argv[1], "test.pdf") as file:
IndexError: list index out of range

It suspect I'm not getting something fundamental here.. can anybody help?

import sys

with open(sys.argv[1], "test.pdf") as file:
pdf = file.read()

startmark = b"\xff\xd8"
startfix = 0
endmark = b"\xff\xd9"
endfix = 2
i = 0

njpg = 0
while True:
istream = pdf.find(b"stream", i)
if istream < 0:
break
istart = pdf.find(startmark, istream, istream + 20)
if istart < 0:
i = istream + 20
continue
iend = pdf.find(b"endstream", istart)
if iend < 0:
raise Exception("Didn't find end of stream!")
iend = pdf.find(endmark, iend - 20)
if iend < 0:
raise Exception("Didn't find end of JPG!")

istart += startfix
iend += endfix
print("JPG %d from %d to %d" % (njpg, istart, iend))
jpg = pdf[istart:iend]
with open("jpg%d.jpg" % njpg, "wb") as jpgfile:
jpgfile.write(jpg)

njpg += 1
i = iend
[gravatar]
Max A H Hartvigsen 12:56 AM on 7 Jun 2017
I got it now.. it turns out did no understand the (sys.argv[1]) (which is pretty basic stuff...) anyway I've learned something new.. great script did worked perfectly on my PDF's
[gravatar]
Try PDFXplorer for exploring PDF structure. It works, as opposed to all other tools I tried.
[gravatar]
In case if we need to find out other formats like png, jpg. How do we do that? Please help me with the code on hayatt.143@gmail.com
[gravatar]
I am trying to implement the Python3 code posted here. The code executes without errors, however I do not see any generated output files. How could I solve that?
[gravatar]
all images should be named jpg.jpg in the same directory. If not, then the PDF doesn't contain any images.
[gravatar]
could anyone help me or send me sample running code or solution,
i'm getting following Exception:-

pdf = file(sys.argv[1], "document-page1.pdf").read()
IndexError: list index out of range

import sys

pdf = file(sys.argv[1], "document-page1.pdf").read()

startmark = "\xff\xd8"
startfix = 0
endmark = "\xff\xd9"
endfix = 2
i = 0

njpg = 0
while True:
istream = pdf.find("stream", i)
if istream < 0:
break
istart = pdf.find(startmark, istream, istream + 20)
if istart < 0:
i = istream + 20
continue
iend = pdf.find("endstream", istart)
if iend < 0:
raise Exception("Didn't find end of stream!")
iend = pdf.find(endmark, iend - 20)
if iend < 0:
raise Exception("Didn't find end of JPG!")

istart += startfix
iend += endfix
print "JPG %d from %d to %d" % (njpg, istart, iend)
jpg = pdf[istart:iend]
jpgfile = file("jpg%d.jpg" % njpg, "wb")
jpgfile.write(jpg)
jpgfile.close()

njpg += 1
i = iend
[gravatar]
Hi. I am a blind person and needed a script to save scanned(unstructured) to jpg to use with image processing. But the options like wand, pdf2jpg, pdf2image all need image magick or poppler and when I make a stand alone app it does not import the needed dependencies to work. All I need to to is change the scanned pages that is most probably a image to a jpg. So the only thing I need is to change the .pdf to .jpg without any processing for the scans has no text or images to read. If I use pyPDF2 it try and read it and a unstructured scan has nothing to read. How can I apply this to a multipal scanned pdf to save each page as a jpg. Need this to assist with accessibility application. Lots of us can not use the online services for we work with company documentation and in SA the people still just scan without scanning as searchable pdf. Thank you for this script. If I have a jpg I can process it and tesseract it.
[gravatar]
Hi i am getting error when using ,
istream = pdf.find("stream", i)

File "class_type.py", line 15, in
istream = pdf.find("stream", i)
TypeError: argument should be integer or bytes-like object, not 'str'

Please help
[gravatar]
It needs bytes object as parameter . Solved.
[gravatar]
You saved my project!

I see this is years later and I'm using python3.
Only getting the hang of python as the pandemic has cancelled my summer.

But I managed to make some changes that make it run for me.
The file() functions were a problem and the "strings" needed to be b"byte strings"

# Extract jpg's from pdf's. Quick and dirty.
import sys

file_name = 'Tom_Foley 5_11_38_GP.pdf'

##pdf = file(sys.argv[1], "rb").read()
handle = open(file_name, "rb")

pdf = handle.read()

startmark = b"\xff\xd8"

startfix = 0

endmark = b"\xff\xd9"

endfix = 2

i = 0

njpg = 0

while True:
istream = pdf.find(b"stream", i)

if istream < 0:
break

istart = pdf.find(startmark, istream, istream+20)

if istart < 0:
i = istream+20

continue

iend = pdf.find(b"endstream", istart)

if iend < 0:
raise Exception("Didn't find end of stream!")

iend = pdf.find(endmark, iend-20)

if iend < 0:
raise Exception("Didn't find end of JPG!")

istart += startfix


iend += endfix

print("JPG %d from %d to %d" % (njpg, istart, iend))

jpg = pdf[istart:iend]

jpgfile = open("jpg%d.jpg" % njpg, "wb")

jpgfile.write(jpg)

jpgfile.close()

njpg += 1

i = iend

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.