public class TextExtractor extends AbstractTextExtractor
Copyright (c) 2020 xsx All Rights Reserved. x-easypdf-pdfbox is licensed under Mulan PSL v2. You can use this software according to the terms and conditions of the Mulan PSL v2. You may obtain a copy of Mulan PSL v2 at: http://license.coscl.org.cn/MulanPSL2 THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY OR FIT FOR A PARTICULAR PURPOSE. See the Mulan PSL v2 for more details.
AbstractTextExtractor.Function<R>
TABLE_PATTERN
document, log
Constructor and Description |
---|
TextExtractor(Document document)
有参构造
|
Modifier and Type | Method and Description |
---|---|
Map<Integer,List<String>> |
extractByRegex(String regex,
int... pageIndexes)
正则提取文本
|
Map<Integer,Map<String,String>> |
extractByRegionArea(String wordSeparator,
Map<String,Rectangle> regionArea,
int... pageIndexes)
提取文本
|
Map<Integer,Map<String,List<List<String>>>> |
extractByTable(String wordSeparator,
Map<String,Rectangle> regionArea,
int... pageIndexes)
表格提取文本
|
extractText, processTextByRegex, processTextByRegionArea, processTextByTable
getDocument
public TextExtractor(Document document)
document
- 文档public Map<Integer,List<String>> extractByRegex(String regex, int... pageIndexes)
extractByRegex
in class AbstractTextExtractor
regex
- 正则表达式pageIndexes
- 页面索引key = 页面索引,value = 提取文本
public Map<Integer,Map<String,String>> extractByRegionArea(String wordSeparator, Map<String,Rectangle> regionArea, int... pageIndexes)
extractByRegionArea
in class AbstractTextExtractor
wordSeparator
- 单词分隔符regionArea
- 区域pageIndexes
- 页面索引key = 页面索引,value = 提取文本
public Map<Integer,Map<String,List<List<String>>>> extractByTable(String wordSeparator, Map<String,Rectangle> regionArea, int... pageIndexes)
extractByTable
in class AbstractTextExtractor
wordSeparator
- 单词分隔符regionArea
- 区域pageIndexes
- 页面索引key = 页面索引,value = 提取文本
Copyright © 2024. All rights reserved.