PdfTextExtractor (openpdf 1.2.7 API)

java.lang.Object
- com.lowagie.text.pdf.parser.PdfTextExtractor

public class PdfTextExtractor
extends Object

Extracts text from a PDF file.

Since:: 2.1.4

Constructor Summary

Constructors
Constructor and Description
`PdfTextExtractor(PdfReader reader)` Creates a new Text Extractor object, using a `TextAssembler` as the render listener
`PdfTextExtractor(PdfReader reader, boolean usePdfMarkupElements)` Creates a new Text Extractor object, using a `TextAssembler` as the render listener
`PdfTextExtractor(PdfReader reader, TextAssembler renderListener)` Creates a new Text Extractor object.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`String`	`getTextFromPage(int page)` Gets the text from a page.
`String`	`getTextFromPage(int page, boolean useContainerMarkup)` get the text from the page
`void`	`processContent(byte[] contentBytes, PdfDictionary resources, PdfContentStreamHandler handler)` Processes PDF syntax

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - PdfTextExtractor
```
public PdfTextExtractor(PdfReader reader)
```
    Creates a new Text Extractor object, using a TextAssembler as the render listener
    
    Parameters:
    
    reader - the reader with the PDF
  - PdfTextExtractor
```
public PdfTextExtractor(PdfReader reader,
                        boolean usePdfMarkupElements)
```
    Creates a new Text Extractor object, using a TextAssembler as the render listener
    
    Parameters:
    
    reader - the reader with the PDF
    
    usePdfMarkupElements - should we use higher level tags for PDF markup entities?
  - PdfTextExtractor
```
public PdfTextExtractor(PdfReader reader,
                        TextAssembler renderListener)
```
    Creates a new Text Extractor object.
    
    Parameters:
    
    reader - the reader with the PDF
    
    renderListener - the render listener that will be used to analyze renderText operations and provide resultant text
- Method Detail
  - getTextFromPage
```
public String getTextFromPage(int page)
                       throws IOException
```
    Gets the text from a page.
    
    Parameters:
    
    page - the 1-based page number of page
    
    Returns:
    
    a String with the content as plain text (without PDF syntax)
    
    Throws:
    
    IOException - on error
  - getTextFromPage
```
public String getTextFromPage(int page,
                              boolean useContainerMarkup)
                       throws IOException
```
    get the text from the page
    
    Parameters:
    
    page - page number we are interested in
    
    useContainerMarkup - should we put tags in for PDf markup container elements (not really HTML at the moment).
    
    Returns:
    
    result of extracting the text, with tags as requested.
    
    Throws:
    
    IOException - on error
  - processContent
```
public void processContent(byte[] contentBytes,
                           PdfDictionary resources,
                           PdfContentStreamHandler handler)
```
    Processes PDF syntax
    
    Parameters:
    
    contentBytes - the bytes of a content stream
    
    resources - the resources that come with the content stream
    
    handler - interprets events caused by recognition of operations in a content stream.

Class PdfTextExtractor

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

PdfTextExtractor

PdfTextExtractor

PdfTextExtractor

Method Detail

getTextFromPage

getTextFromPage

processContent