Thursday 6 December 2007 — This is nearly 17 years old. Be careful.
I was trying to diagnose a problem with a PDF file we generated yesterday, and suspected that the images were corrupted. To see, I wrote this quick script to extract JPGs from PDF files. It is quick and dirty, with the absolute minimum understanding of PDF files, which can be quite opaque.
# Extract jpg's from pdf's. Quick and dirty.
import sys
pdf = file(sys.argv[1], "rb").read()
startmark = "\xff\xd8"
startfix = 0
endmark = "\xff\xd9"
endfix = 2
i = 0
njpg = 0
while True:
istream = pdf.find("stream", i)
if istream < 0:
break
istart = pdf.find(startmark, istream, istream+20)
if istart < 0:
i = istream+20
continue
iend = pdf.find("endstream", istart)
if iend < 0:
raise Exception("Didn't find end of stream!")
iend = pdf.find(endmark, iend-20)
if iend < 0:
raise Exception("Didn't find end of JPG!")
istart += startfix
iend += endfix
print "JPG %d from %d to %d" % (njpg, istart, iend)
jpg = pdf[istart:iend]
jpgfile = file("jpg%d.jpg" % njpg, "wb")
jpgfile.write(jpg)
jpgfile.close()
njpg += 1
i = iend
This script works for my PDF files. Maybe it doesn’t work for all, I don’t know. PDF files are complex beasts. Your mileage may vary.
What I’d really like is a tool for exploring inside PDF files, so that I could see exactly what’s going on in there. pyPdf is a start, but only scratches the surface of the kind of stuff I’d like to see...
Comments
http://www.boddie.org.uk/david/Projects/Python/pdftools/
However, the code is also quite long and assumes certain things about the file it is given. It doesn't look like I can post code here, either. :-(
normally has all of the pdf tools that I need, but I could not see
how to make it extract the images. I went back to the pdf just
now to test out your script and it worked perfectly.
JPG = Just PlayinG
http://chaos.troll.no/~dboddie/Python/pdf_extract_jpegs.html
It even tries to decode images that it thinks are PNGs, but only in a restricted set of circumstances.
I'm the author of pyPdf. I've often wished for a program to explore inside PDF files myself. But, typically I need it for files that pyPdf can't open, so I've never developed one using pyPdf. ;-)
I think that pyPdf has the necessary core functionality that would be needed for a general purpose PDF library, but it clearly lacks in the implementation of useful functions based upon that core. For example, it can read and decode stream objects, but there's no function that just says "Get me all the images on this page". Someday...
Thanks for the link and exposure.
I have a question about other graphic formats: how can I find png,gif,tiff etc etc as they have only a "start" magic number, but no end mark?
Bye
http://pdfedit.petricek.net/en/index.html
I mean, what are those marks ? "\xff\xd8" and "\xff\xd9"
Is a PDF header or JPG header ? an hexdump of a jpg shows something different.
The program works.
I have a scanned document in pdf format and I tried the suggested code, unfortunately it did not work. I also try to make the suggestions with changing the strings to bytes, no luck there either. Is there anything else I should be aware of with new editions of python.
Thank you in advance,
Vanya
Check out: http://www.pqscan.com/pdf-to-image/.
http://pythonapplications.blogspot.com/2017/01/extract-all-jpgs-from-pdf-file.html
Thank you for sharing!
, line 9, in
with open(sys.argv[1], "test.pdf") as file:
IndexError: list index out of range
It suspect I'm not getting something fundamental here.. can anybody help?
import sys
with open(sys.argv[1], "test.pdf") as file:
pdf = file.read()
startmark = b"\xff\xd8"
startfix = 0
endmark = b"\xff\xd9"
endfix = 2
i = 0
njpg = 0
while True:
istream = pdf.find(b"stream", i)
if istream < 0:
break
istart = pdf.find(startmark, istream, istream + 20)
if istart < 0:
i = istream + 20
continue
iend = pdf.find(b"endstream", istart)
if iend < 0:
raise Exception("Didn't find end of stream!")
iend = pdf.find(endmark, iend - 20)
if iend < 0:
raise Exception("Didn't find end of JPG!")
istart += startfix
iend += endfix
print("JPG %d from %d to %d" % (njpg, istart, iend))
jpg = pdf[istart:iend]
with open("jpg%d.jpg" % njpg, "wb") as jpgfile:
jpgfile.write(jpg)
njpg += 1
i = iend
i'm getting following Exception:-
pdf = file(sys.argv[1], "document-page1.pdf").read()
IndexError: list index out of range
import sys
pdf = file(sys.argv[1], "document-page1.pdf").read()
startmark = "\xff\xd8"
startfix = 0
endmark = "\xff\xd9"
endfix = 2
i = 0
njpg = 0
while True:
istream = pdf.find("stream", i)
if istream < 0:
break
istart = pdf.find(startmark, istream, istream + 20)
if istart < 0:
i = istream + 20
continue
iend = pdf.find("endstream", istart)
if iend < 0:
raise Exception("Didn't find end of stream!")
iend = pdf.find(endmark, iend - 20)
if iend < 0:
raise Exception("Didn't find end of JPG!")
istart += startfix
iend += endfix
print "JPG %d from %d to %d" % (njpg, istart, iend)
jpg = pdf[istart:iend]
jpgfile = file("jpg%d.jpg" % njpg, "wb")
jpgfile.write(jpg)
jpgfile.close()
njpg += 1
i = iend
istream = pdf.find("stream", i)
File "class_type.py", line 15, in
istream = pdf.find("stream", i)
TypeError: argument should be integer or bytes-like object, not 'str'
Please help
I see this is years later and I'm using python3.
Only getting the hang of python as the pandemic has cancelled my summer.
But I managed to make some changes that make it run for me.
The file() functions were a problem and the "strings" needed to be b"byte strings"
# Extract jpg's from pdf's. Quick and dirty.
import sys
file_name = 'Tom_Foley 5_11_38_GP.pdf'
##pdf = file(sys.argv[1], "rb").read()
handle = open(file_name, "rb")
pdf = handle.read()
startmark = b"\xff\xd8"
startfix = 0
endmark = b"\xff\xd9"
endfix = 2
i = 0
njpg = 0
while True:
istream = pdf.find(b"stream", i)
if istream < 0:
break
istart = pdf.find(startmark, istream, istream+20)
if istart < 0:
i = istream+20
continue
iend = pdf.find(b"endstream", istart)
if iend < 0:
raise Exception("Didn't find end of stream!")
iend = pdf.find(endmark, iend-20)
if iend < 0:
raise Exception("Didn't find end of JPG!")
istart += startfix
iend += endfix
print("JPG %d from %d to %d" % (njpg, istart, iend))
jpg = pdf[istart:iend]
jpgfile = open("jpg%d.jpg" % njpg, "wb")
jpgfile.write(jpg)
jpgfile.close()
njpg += 1
i = iend
Add a comment: