问题描述:

I'm looking to extract texts from PDFs for a data-mining task.

The PDFs I'm looking at contain multiple reports, each report has its own first level entry in the documents table of contents. Also, there is a written table of contents at the beginning of the PDF, which contains page numbers for each report ("from page - to page").

I'm looking for a way to either:

  • Split the PDF into the individual reports, in order to dump each of those into a .txt file.

  • Dump each section of the PDF into a .txt directly.

So far, I have been able to dump to entire file into a .txt using PDFminer (python), as follows:

# Not all imports are needed for this task

from pdfminer.pdfparser import PDFParser

from pdfminer.pdfdocument import PDFDocument

import sys

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter

from pdfminer.pdfpage import PDFPage

from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter

from pdfminer.layout import LAParams

from cStringIO import StringIO

def myparse(data):

fp = file(data, 'rb')

rsrcmgr = PDFResourceManager()

retstr = StringIO()

codec = 'utf-8'

laparams = LAParams()

device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

# Create a PDF interpreter object.

interpreter = PDFPageInterpreter(rsrcmgr, device)

# Process each page contained in the document.

for page in PDFPage.get_pages(fp):

interpreter.process_page(page)

#fp.close()

#device.close()

str = retstr.getvalue()

#retstr.close()

return str

t1 = myparse("part2.pdf")

text_file = open("part2.txt", "w")

text_file.write(t1)

text_file.close()

Also, this returns the entire structure of the table of contents:

# Open a PDF document.

fp = open('solar.pdf', 'rb')

parser = PDFParser(fp)

password = ""

document = PDFDocument(parser, password)

# Get the outlines of the document.

outlines = document.get_outlines()

for (level,title,dest,a,se) in outlines:

print (level, title, a)

Any idea how to go ahead from here? Any tools using python, R or bash would be easiest to use for me personally, but as long as it enables batch splitting based on the first outline level of the document, any solution would be great.

Thank you,

Matthias

网友答案:

I've found a straightforward solution for this using sejda-console:

from subprocess import call
import os

pdfname = "example.pdf"


outdir = "C:\\out\\%s" % pdfname
if not os.path.exists(outdir):
    os.makedirs(outdir)



sejda = 'C:\\sejda\\bin\\sejda-console.bat'
call = sejda
call += ' splitbybookmarks'
call += ' --bookmarkLevel 1'
call += ' -f "%s"' % pdfname
call += ' -o "%s"' % outdir
print '\n', call
subprocess.call(call)
print "PDFs have been written to out-directory"

Abviously this requires the sejda programme: http://www.sejda.org/

相关阅读:
Top