I was trying to diagnose a problem with a PDF file we generated yesterday, and suspected that the images were corrupted. To see, I wrote this quick script to extract JPGs from PDF files. It is quick and dirty, with the absolute minimum understanding of PDF files, which can be quite opaque.
# Extract jpg's from pdf's. Quick and dirty.
pdf = file(sys.argv, "rb").read()
startmark = "\xff\xd8"
startfix = 0
endmark = "\xff\xd9"
endfix = 2
i = 0
njpg = 0
istream = pdf.find("stream", i)
if istream < 0:
istart = pdf.find(startmark, istream, istream+20)
if istart < 0:
i = istream+20
iend = pdf.find("endstream", istart)
if iend < 0:
raise Exception("Didn't find end of stream!")
iend = pdf.find(endmark, iend-20)
if iend < 0:
raise Exception("Didn't find end of JPG!")
istart += startfix
iend += endfix
print "JPG %d from %d to %d" % (njpg, istart, iend)
jpg = pdf[istart:iend]
jpgfile = file("jpg%d.jpg" % njpg, "wb")
njpg += 1
i = iend
This script works for my PDF files. Maybe it doesn’t work for all, I don’t know. PDF files are complex beasts. Your mileage may vary.
What I’d really like is a tool for exploring inside PDF files, so that I could see exactly what’s going on in there. pyPdf is a start, but only scratches the surface of the kind of stuff I’d like to see...