问题描述:

Using pypdf python module how to read the following pdf file http://www.envis-icpe.com/pointcounterpointbook/Hindi_Book.pdf

# -*- coding: utf-8 -*-

from pyPdf import PdfFileWriter, PdfFileReader

import pyPdf

def getPDFContent(path):

content = ""

# Load PDF into pyPDF

pdf = pyPdf.PdfFileReader(file(path, "rb"))

# Iterate pages

for i in range(0, pdf.getNumPages()):

# Extract text from page and add to content

content += pdf.getPage(i).extractText() + "\n"

# Collapse whitespace

content = " ".join(content.replace(u"\xa0", " ").strip().split())

return content

print getPDFContent("/home/tom/Desktop/Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")

The above prints only binary

And how to print the contents from the below code

from pyPdf import PdfFileWriter, PdfFileReader

import sys

import pyPdf

from pyPdf import PdfFileWriter, PdfFileReader

output = PdfFileWriter()

input1 = PdfFileReader(file("/home/tom/Desktop/Hindi_Book.pdf", "rb"))

# print the title of document1.pdf

print "title = %s" % (input1.getDocumentInfo().title)

网友答案:

Note that most of the "text" of the pdf document you refer to isn't real text at all: it's mostly images. The actual text seems to get extracted correctly when I try it (although I must admit that apart from some snippets on the front page and the page numbers, I can't read it ;-)).

As for the second question: I'm not sure what you're asking there.

相关阅读:
Top