问题描述:

l have thousands of pdf documents that are 11-15mb. My program says that my document contains more than 100k characters.

Error output:

Exception in thread "main"

org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException:

Your document contained more than 100000 characters, and so your

requested limit has been reached. To receive the full text of the

document, increase your limit.

How can l increase the limit to 10-15mb ?

I found a solution which is new Tika facade class but l could not find a way to integrate it with mine.

 Tika tika = new Tika();

tika.setMaxStringLength(10*1024*1024);

Here is my code:

 BodyContentHandler handler = new BodyContentHandler();

Metadata metadata = new Metadata();

String location = "C:\\Users\\Laptop\\Dropbox\\MainTextbookTrappe2ndEd.pdf";

FileInputStream inputstream = new FileInputStream(location);

ParseContext pcontext = new ParseContext();

PDFParser pdfparser = new PDFParser();

pdfparser.parse(inputstream, handler, metadata, pcontext);

Output:

System.out.println("Content of the PDF :" + pcontext);

网友答案:

Use

BodyContentHandler handler = new BodyContentHandler(-1);

to disable the limit. From the Javadoc:

The internal string buffer is bounded at the given number of characters. If this write limit is reached, then a SAXException is thrown.
Parameters: writeLimit - maximum number of characters to include in the string, or -1 to disable the write limit

相关阅读:
Top