jPDFText Developer Guide

jPDFText Developer Guide

EnglishGerman

Contents

javalogo
Introduction
Getting Started
Extracting Text
Extracting Text Page by Page
Extracting Words as a Vector of Strings
Extracting Words Page by Page
Getting Basic Document Information
Distribution and JAR files

Javadoc API
Source Code Samples

Introduction

jPDFText is a Java library that integrates seamlessly into your application or applet to extract words from PDF documents. jPDFText provides the following functions:

  • Load PDF documents from files, network drives, URLs or input streams
  • Get basic information from the pdf document such as title, author, keywords, page count, etc.
  • Extract words from pdf documents as a vector of String
  • Extract words page by page

Like all of our libraries, jPDFText is built on top of Qoppa’s proprietary format and doesn’t require any third party programs or drivers.

Getting Started

The starting point for using jPDFText is the com.qoppa.pdfText.PDFText. This class is used to load a pdf document and extract the text from the document. The class provides three constructors to load PDF files from the file system, a URL or an InputStream. All constructors take an additional parameter, an object that implements IPasswordHandler, that will be queried if the PDF file has requires a password to open. For PDF files that are not encrypted, this second parameter can be null:

PDFText pdfText = new PDFText (new URL("http://www.mysite.com/content.pdf"), null);

Extract Text

Once a PDFText object has been created, the host application simply needs to call the getText method to get the text from the loaded PDF document. The text is returned as a String.

// get text as a String
String text = pdfText.getText();
 
// print the text
System.out.println(text);

Extracting Text Page by Page

To extract the text page by page, use the getText method that takes a page number as a parameter. You can get the number of pages from the PDFText object through the getPageCount method.

// get page count
int pageCount = pdfText.getPageCount();
 
for(int count = 0; count < pageCount; count++) {
  // get text for page (count+1)
  String text = getText(count+1);
 
  // print the text
  System.out.println(" PAGE " + (count + 1));
  System.out.println(text);
}

Extracting Words as a Vector of Strings

Once a PDFText object has been created, the host application simply needs to call the getWords method to get the list of words from the loaded PDF document.

// get list of words as a vector of strings
Vector words = pdfText.getWords();
 
// loop through the words and print them
for(int count = 0; count < words.size(); count++) {
  System.out.println("[" + words.get(count) + "] ");
}

Extracting Words Page by Page

To extract words page by page, use the getWords method that takes a page number as a parameter. You can get the number of pages from the PDFText object through the getPageCount method.

// get page count
int pageCount = pdfText.getPageCount();
 
for(int count = 0; count < pageCount; count++) {
  // get words for page (count+1)
  Vector words = getWords(count+1);
 
  // loop through the words and print them
  System.out.println(" PAGE " + (count + 1));
 
  for(int wordCount = 0; wordCount < words.size(); wordCount++) {
    System.out.println("[" + words.get(wordCount) + "] ");
  }
}

Getting Basic Information about the PDF Document (Title, Author, etc.)

To get basic information about the loaded PDF document, you need to get the DocumentInfo class accessible through PDFText.getDocumentInfo. From this class, you can get information about the document such as title, author, subject, keywords, etc.

System.out.println(pdfText.getDocumentInfo().getTitle());
System.out.println(pdfText.getDocumentInfo().getAuthor());
System.out.println(pdfText.getDocumentInfo().getKeywords());

Distribution and JAR Files

Only the jPDFText.jar file is always required for deployment, the remaining jar files are used to work with specific features in some PDF documents:

jPDFText.jar – This is the main jar file for the component, it is always required.

cmaps.jar – This jar file contains CMaps, used to read and display character encodings used with CJK (Chinese, Japanese, Korean) content.

Javadoc API
Source Code Samples