Home Contact Us

jPFText

Download

Documentation

Live Demo

FAQ

Pricing

 

Java PDF Libraries

jPDFFields
jPDFImages
jPDFNotes
jPDFPrint
jPDFProcess
jPDFSecure
jPDFText
jPDFViewer
jPDFWriter

 

PDF Products Comparison Grid

 

PDF Studio

 
jPDFText Developer Guide English Deutsch

Contents

Introduction
Getting Started
Extracting Text
Extracting Text Page by Page
Extracting Words as a Vector of Strings
Extracting Words Page by Page
Getting Basic Document Information

Distribution and JAR files

Javadoc API
Java Powered Logo

 

Introduction

jPDFText is a Java library that integrates seamlessly into your application or applet to extract words from PDF documents. jPDFText provides the following functions:

  • Load PDF documents from files, network drives, URLs or input streams
  • Get basic information from the pdf document such as title, author, keywords, page count, etc..
  • Extract words from pdf documents as a vector of String
  • Extract words page by page

Like all of our libraries, jPDFText is built on top of Qoppa's proprietary format and doesn't require any third party programs or drivers.

Getting Started

The starting point for using jPDFText is the com.qoppa.pdfText.PDFText. This class is used to load a pdf document and extract the text from the document. The class provides three constructors to load PDF files from the file system, a URL or an InputStream. All constructors take an additional parameter, an object that implements IPasswordHandler, that will be queried if the PDF file has requires a password to open. For PDF files that are not encrypted, this second parameter can be null:

PDFText pdfText = new PDFText (new URL(http://www.qoppa.com/content.pdf"), null);


Extracting Text

Once a PDFText object has been created, the host application simply needs to call the getText method to get the text from the loaded PDF document. The text is returned as a String.

// get text as a String
String text = pdfText.getText();

// print the text
System.out.println(text);

 

Extracting Text Page by Page

To extract the text page by page, use the getText method that takes a page number as a parameter. You can get the number of pages from the PDFText object through the getPageCount method.

// get page count
int pageCount = pdfText.getPageCount();

for(int count = 0; count < pageCount; count++)
{

// get text for page (count+1)
String text = getText(count+1);

// print the text
System.out.println(" PAGE " + (count + 1));
System.out.println(text);

}


Extracting Words as a Vector of Strings

Once a PDFText object has been created, the host application simply needs to call the getWords method to get the list of words from the loaded PDF document.

// get list of words as a vector of strings
Vector words = pdfText.getWords();

// loop through the words and print them
for(int count = 0; count < words.size(); count++)
{

System.out.println("[" + words.get(count) + "] ");
}

Extracting Words Page by Page

To extract words page by page, use the getWords method that takes a page number as a parameter. You can get the number of pages from the PDFText object through the getPageCount method.

// get page count
int pageCount = pdfText.getPageCount();

for(int count = 0; count < pageCount; count++)
{

// get words for page (count+1)
Vector words = getWords(count+1);

// loop through the words and print them
System.out.println(" PAGE " + (count + 1));

for(int wordCount = 0; wordCount < words.size(); wordCount++)
{
System.out.println("[" + words.get(wordCount) + "] ");
}
}


Getting Basic Information about the PDF Document (Title, Author, etc...)

To get basic information about the loaded PDF document, you need to get the DocumentInfo class accessible through PDFText.getDocumentInfo. From this class, you can get information about the document such as title, author, subject, keywords, etc...

System.out.println(pdfText.getDocumentInfo().getTitle());
System.out.println(pdfText.getDocumentInfo().getAuthor());
System.out.println(pdfText.getDocumentInfo().getKeywords());


Distribution and JAR Files

jPDFText is packaged in a single jar file, jPDFText.jar that gets installed with the evaluation sample. When distributing an application that contains jPDFText, the jPDFText.jar file needs to be distributed along with it and needs to be included in the class path when running the application.

 
 

Contact Us Site Map

Copyright © 2002-Present Qoppa Software. All rights reserved.

Java and all Java-based marks are trademarks or registered trademarks of

Sun Microsystems, Inc. in the U.S. and other countries.