Skip navigation links
A B C D E F G H I K L M N O P Q R S T U V W X 

A

ABSOLUTE_ADDR_LINK - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.RuleConstant
绝对地址标志
AbsoluteLinkFilter - Class in com.yishuifengxiao.common.crawler.link.filter.impl
绝对地址链接过滤器
处理绝对地址链接,将其拼接成网络地址
AbsoluteLinkFilter(BaseLinkFilter) - Constructor for class com.yishuifengxiao.common.crawler.link.filter.impl.AbsoluteLinkFilter
 
AbstractContentExtractor - Class in com.yishuifengxiao.common.crawler.extractor.content
抽象内容提取器
AbstractContentExtractor(ExtractRule) - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.AbstractContentExtractor
 
AbstractExtractorFactory - Class in com.yishuifengxiao.common.crawler.extractor
抽象提取器工厂
AbstractExtractorFactory() - Constructor for class com.yishuifengxiao.common.crawler.extractor.AbstractExtractorFactory
 
addData(Map<String, Object>) - Method in class com.yishuifengxiao.common.crawler.domain.entity.CrawlerData
增加数据
addData(String, Object) - Method in class com.yishuifengxiao.common.crawler.domain.entity.CrawlerData
增加数据
addData(Map<String, Object>) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
增加输出数据
addData(String, Object) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
增加输出数据
addExtra(Map<String, Object>) - Method in class com.yishuifengxiao.common.crawler.Crawler
设置风铃虫携带的额外信息
此设置不会清空原始的额外信息,而是将新的数据追加到原始的数据上
addExtractRule(ExtractRule) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
增加内容提取规则
addExtractRule(String, ExtractFieldRule) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
根据内容提取规则的编码增加该内容提取规则的提取规则
addExtractRules(List<ExtractRule>) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
增加内容提取规则
addFieldExtractRules(String, List<ExtractFieldRule>) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
根据内容提取规则的编码增加该内容提取规则的提取规则
addHeader(HeaderRule) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
增加一组请求头参数
addHeaders(List<HeaderRule>) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
增加一组请求头参数
addLinkRule(MatcherRule) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
增加链接提取规则
addLinkRules(Set<MatcherRule>) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
增加链接提取规则
addLinks(List<String>) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
在原来的链接地址集合里增加新的链接信息
AllStrategy - Class in com.yishuifengxiao.common.crawler.extractor.content.strategy.impl
原文提取策略
不对输入参数进行处理, 直接返回输入数据
AllStrategy() - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.AllStrategy
 
ANT_MATCH_ALL - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.RuleConstant
ant模式下匹配全部的表达式
ArrayStrategy - Class in com.yishuifengxiao.common.crawler.extractor.content.strategy.impl
数组截取策略
将输入数据根据分隔符分割为数组,然后从分割的数组中提取第N个元素
ArrayStrategy() - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.ArrayStrategy
 

B

BaseDownloader - Class in com.yishuifengxiao.common.crawler.downloader
selenium下载器基类
所有基于selenium的下载器最好根据此基类完成
BaseDownloader(String) - Constructor for class com.yishuifengxiao.common.crawler.downloader.BaseDownloader
构造函数
必须传入浏览器驱动文件geckodriver所在地址的路径
geckodriver文件的下载路径为 https://github.com/mozilla/geckodriver/releases
请根据运行环境的信息配置好此参数
BaseLinkFilter - Class in com.yishuifengxiao.common.crawler.link.filter
抽象链接过滤器
将提取的链接地址转换成网络地址形式
BaseLinkFilter(BaseLinkFilter) - Constructor for class com.yishuifengxiao.common.crawler.link.filter.BaseLinkFilter
 
build() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
构建一个风铃虫规则

C

CACHE_CONTROL - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.SiteConstant
网页的缓存是由HTTP消息头中的“Cache-Control”来控制的,常见的取值有private、no-cache、max-age、must-revalidate等,默认为private。其作用根据不同的重新浏览方式分为以下几种情况。
cacheControl() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取网页缓存策略
默认为 max-age=0
cacheControl(String) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置网页缓存策略
默认为 max-age=0
CHARSET - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.NestConstant
网页SEO信息中的charset
CharsetContentExtractor - Class in com.yishuifengxiao.common.crawler.extractor.content.impl
charset提取器
提取网页中meta 区域中的keywords信息
CharsetContentExtractor() - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.impl.CharsetContentExtractor
 
CHILD_PATH_FLAG_COUNT - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.RuleConstant
带有子路径的标志
ChnStrategy - Class in com.yishuifengxiao.common.crawler.extractor.content.strategy.impl
中文提取策略
提取出输入数据里的所有中文信息
ChnStrategy() - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.ChnStrategy
 
circularRedirectsAllowed() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取是否应允许循环重定向
circularRedirectsAllowed(boolean) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置是否应允许循环重定向
clear() - Method in class com.yishuifengxiao.common.crawler.Crawler
清空数据
clear() - Method in class com.yishuifengxiao.common.crawler.domain.entity.CrawlerData
清空数据
clear(Task) - Method in class com.yishuifengxiao.common.crawler.scheduler.impl.RedisScheduler
 
clear(Task) - Method in class com.yishuifengxiao.common.crawler.scheduler.impl.SimpleScheduler
 
clear(Task) - Method in interface com.yishuifengxiao.common.crawler.scheduler.Scheduler
清空任务
clear(Task) - Method in class com.yishuifengxiao.common.crawler.scheduler.SchedulerDecorator
 
clear() - Static method in class com.yishuifengxiao.common.crawler.utils.LocalCrawler
清理缓存
clearLinks() - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
清空链接地址
close() - Method in class com.yishuifengxiao.common.crawler.downloader.BaseDownloader
 
close() - Method in interface com.yishuifengxiao.common.crawler.downloader.Downloader
关闭下载器以释放资源
close() - Method in class com.yishuifengxiao.common.crawler.downloader.impl.SimpleDownloader
 
CN_COM_DOMAIN - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.RuleConstant
com.cn域名表达式
com.yishuifengxiao.common.crawler - package com.yishuifengxiao.common.crawler
 
com.yishuifengxiao.common.crawler.cache - package com.yishuifengxiao.common.crawler.cache
 
com.yishuifengxiao.common.crawler.content - package com.yishuifengxiao.common.crawler.content
 
com.yishuifengxiao.common.crawler.content.impl - package com.yishuifengxiao.common.crawler.content.impl
 
com.yishuifengxiao.common.crawler.content.matcher - package com.yishuifengxiao.common.crawler.content.matcher
 
com.yishuifengxiao.common.crawler.domain.constant - package com.yishuifengxiao.common.crawler.domain.constant
 
com.yishuifengxiao.common.crawler.domain.entity - package com.yishuifengxiao.common.crawler.domain.entity
 
com.yishuifengxiao.common.crawler.domain.eunm - package com.yishuifengxiao.common.crawler.domain.eunm
 
com.yishuifengxiao.common.crawler.domain.model - package com.yishuifengxiao.common.crawler.domain.model
 
com.yishuifengxiao.common.crawler.downloader - package com.yishuifengxiao.common.crawler.downloader
 
com.yishuifengxiao.common.crawler.downloader.impl - package com.yishuifengxiao.common.crawler.downloader.impl
 
com.yishuifengxiao.common.crawler.extractor - package com.yishuifengxiao.common.crawler.extractor
 
com.yishuifengxiao.common.crawler.extractor.content - package com.yishuifengxiao.common.crawler.extractor.content
 
com.yishuifengxiao.common.crawler.extractor.content.impl - package com.yishuifengxiao.common.crawler.extractor.content.impl
 
com.yishuifengxiao.common.crawler.extractor.content.strategy - package com.yishuifengxiao.common.crawler.extractor.content.strategy
 
com.yishuifengxiao.common.crawler.extractor.content.strategy.impl - package com.yishuifengxiao.common.crawler.extractor.content.strategy.impl
 
com.yishuifengxiao.common.crawler.extractor.links - package com.yishuifengxiao.common.crawler.extractor.links
 
com.yishuifengxiao.common.crawler.extractor.links.impl - package com.yishuifengxiao.common.crawler.extractor.links.impl
 
com.yishuifengxiao.common.crawler.link - package com.yishuifengxiao.common.crawler.link
 
com.yishuifengxiao.common.crawler.link.filter - package com.yishuifengxiao.common.crawler.link.filter
 
com.yishuifengxiao.common.crawler.link.filter.impl - package com.yishuifengxiao.common.crawler.link.filter.impl
 
com.yishuifengxiao.common.crawler.listener - package com.yishuifengxiao.common.crawler.listener
 
com.yishuifengxiao.common.crawler.macther - package com.yishuifengxiao.common.crawler.macther
 
com.yishuifengxiao.common.crawler.macther.impl - package com.yishuifengxiao.common.crawler.macther.impl
 
com.yishuifengxiao.common.crawler.monitor - package com.yishuifengxiao.common.crawler.monitor
 
com.yishuifengxiao.common.crawler.pipeline - package com.yishuifengxiao.common.crawler.pipeline
 
com.yishuifengxiao.common.crawler.pool - package com.yishuifengxiao.common.crawler.pool
 
com.yishuifengxiao.common.crawler.scheduler - package com.yishuifengxiao.common.crawler.scheduler
 
com.yishuifengxiao.common.crawler.scheduler.impl - package com.yishuifengxiao.common.crawler.scheduler.impl
 
com.yishuifengxiao.common.crawler.scheduler.remover - package com.yishuifengxiao.common.crawler.scheduler.remover
 
com.yishuifengxiao.common.crawler.scheduler.request - package com.yishuifengxiao.common.crawler.scheduler.request
 
com.yishuifengxiao.common.crawler.simulator - package com.yishuifengxiao.common.crawler.simulator
 
com.yishuifengxiao.common.crawler.utils - package com.yishuifengxiao.common.crawler.utils
 
CONNECTION_TIME_OUT - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.SiteConstant
默认的连接超时时间,30000毫秒(30秒)
connectTimeout() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取确定连接建立之前的超时时间(以毫秒为单位)
connectTimeout(int) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置确定连接建立之前的超时时间(以毫秒为单位)
ConstantStrategy - Class in com.yishuifengxiao.common.crawler.extractor.content.strategy.impl
常量提取策略
无论输入数据是什么,直接将设置的常量值作为输出数据输出
ConstantStrategy() - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.ConstantStrategy
 
contain(String) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
该页面里所有提取出来的数据是否包含对应的键
content() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取内容页地址规则
content(ContentRule) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置内容解析规则
contentCompressionEnabled() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
是否是否请求目标服务器压缩内容
contentCompressionEnabled(boolean) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置是否请求目标服务器压缩内容
ContentExtract - Interface in com.yishuifengxiao.common.crawler.content
内容解析器
用于从网页里根据需要提取出目标数据
使用方法核心示例如下:
contentExtract - Variable in class com.yishuifengxiao.common.crawler.content.ContentExtractDecorator
用户自定义的内容解析器
ContentExtractDecorator - Class in com.yishuifengxiao.common.crawler.content
内容解析器装饰器
进行内容解析前的前置操作
功能如下:
1.
ContentExtractDecorator(ContentExtract) - Constructor for class com.yishuifengxiao.common.crawler.content.ContentExtractDecorator
 
ContentExtractor - Interface in com.yishuifengxiao.common.crawler.extractor.content
内容提取器
根据内容提取规则从输入数据里提取出所有符合要求的数据
contentMatcher - Variable in class com.yishuifengxiao.common.crawler.content.ContentExtractDecorator
内容匹配器
ContentMatcher - Interface in com.yishuifengxiao.common.crawler.content.matcher
内容匹配器
根据内容解析规则判断下载的页面是否需要进行数据解析
即网页地址和网页内容均符合此匹配规则才会进行接下来的内容提取工作
contentMatcher - Variable in class com.yishuifengxiao.common.crawler.simulator.SimpleSimulator
内容匹配器
contentPageRule() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取内容页地址规则
多个规则之间用半角逗号隔开
contentPageRule(MatcherRule) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置内容页地址规则
多个规则之间用半角逗号隔开
ContentRule - Class in com.yishuifengxiao.common.crawler.domain.model
内容解析规则
定义了哪些页面为内容页
根据 内容页地址规则 和 内容匹配规则 确定哪些页面是内容页,需要从中提取出数据
ContentRule() - Constructor for class com.yishuifengxiao.common.crawler.domain.model.ContentRule
 
contentRule - Variable in class com.yishuifengxiao.common.crawler.extractor.content.AbstractContentExtractor
 
cookieSpec() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取确定用于HTTP状态管理的cookie规范的名称
cookieSpec(String) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置 确定用于HTTP状态管理的cookie规范的名称
cookieValue() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取请求时携带cookie信息
此值为空时表示由内核智能处理
cookieValue(String) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置请求时携带cookie信息
Crawler - Class in com.yishuifengxiao.common.crawler
风铃虫
Crawler(CrawlerRule) - Constructor for class com.yishuifengxiao.common.crawler.Crawler
构建函数
注意此构造方法不会校验规则定义
CrawlerBuilder - Class in com.yishuifengxiao.common.crawler
风铃虫规则构建器
CrawlerConstant - Class in com.yishuifengxiao.common.crawler.domain.constant
风铃虫常量类
CrawlerConstant() - Constructor for class com.yishuifengxiao.common.crawler.domain.constant.CrawlerConstant
 
CrawlerData - Class in com.yishuifengxiao.common.crawler.domain.entity
解析结果输出数据
CrawlerData(Request) - Constructor for class com.yishuifengxiao.common.crawler.domain.entity.CrawlerData
 
CrawlerListener - Interface in com.yishuifengxiao.common.crawler.listener
风铃虫处理事件监听器
监听风铃虫处理过程中的各种消息时间
注意此监听器是同步的,请勿在此进行任务可能会导致风铃虫阻塞的操作
CrawlerProcessor - Class in com.yishuifengxiao.common.crawler
风铃虫处理器
负责调用线程池使用多线程进行风铃虫任务处理
CrawlerProcessor(Crawler, ThreadPoolExecutor) - Constructor for class com.yishuifengxiao.common.crawler.CrawlerProcessor
 
CrawlerRule - Class in com.yishuifengxiao.common.crawler.domain.entity
风铃虫规则
CrawlerRule() - Constructor for class com.yishuifengxiao.common.crawler.domain.entity.CrawlerRule
 
CrawlerWorker - Class in com.yishuifengxiao.common.crawler
风铃虫处理核心
CrawlerWorker(Request, Downloader, CrawlerProcessor) - Constructor for class com.yishuifengxiao.common.crawler.CrawlerWorker
 
creatCrawler() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
创建一个风铃虫简单实例
create(CrawlerRule) - Static method in class com.yishuifengxiao.common.crawler.Crawler
创建一个默认的风铃虫实例
create() - Static method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
创建一个默认风铃虫规则构建器
create(CrawlerRule) - Static method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
根据已有规则 创建一个默认风铃虫规则构建器
create(SiteRule, Request) - Method in interface com.yishuifengxiao.common.crawler.scheduler.request.RequestCreater
根据站点设置规则补全请求信息
create(SiteRule, Request) - Method in class com.yishuifengxiao.common.crawler.scheduler.request.SimpleRequestCreater
 
CssStrategy - Class in com.yishuifengxiao.common.crawler.extractor.content.strategy.impl
css提取策略
根据css选择器和属性按照CSS提取模式从输入数据里提取出对应区域的数据(可能是HTML片段)
CssStrategy() - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.CssStrategy
 
CssTextStrategy - Class in com.yishuifengxiao.common.crawler.extractor.content.strategy.impl
css文本提取策略
此模式下只会包含内部的数据,不会包含外部html
CssTextStrategy() - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.CssTextStrategy
 

D

DEFAULT_REQUEST_DEPTH - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.SiteConstant
默认的请求深度
DEFAULT_THREAD_NUM - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.SiteConstant
默认使用的线程数,默认值为 1
DESC - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.NestConstant
网页SEO信息中的描述
DescpContentExtractor - Class in com.yishuifengxiao.common.crawler.extractor.content.impl
description提取器
提取网页中meta 区域中的description信息
DescpContentExtractor() - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.impl.DescpContentExtractor
 
doFilter(String, String) - Method in class com.yishuifengxiao.common.crawler.link.filter.BaseLinkFilter
将提取的链接地址转换成网络地址形式
doFilter(BaseLinkFilter, String, String) - Method in class com.yishuifengxiao.common.crawler.link.filter.BaseLinkFilter
将提取的链接地址转换成网络地址形式
doFilter(BaseLinkFilter, String, String) - Method in class com.yishuifengxiao.common.crawler.link.filter.impl.AbsoluteLinkFilter
 
doFilter(BaseLinkFilter, String, String) - Method in class com.yishuifengxiao.common.crawler.link.filter.impl.HashLinkFilter
 
doFilter(BaseLinkFilter, String, String) - Method in class com.yishuifengxiao.common.crawler.link.filter.impl.HttpLinkFilter
 
doFilter(BaseLinkFilter, String, String) - Method in class com.yishuifengxiao.common.crawler.link.filter.impl.IllegalLinkFilter
 
doFilter(BaseLinkFilter, String, String) - Method in class com.yishuifengxiao.common.crawler.link.filter.impl.NothingLinkFilter
 
doFilter(BaseLinkFilter, String, String) - Method in class com.yishuifengxiao.common.crawler.link.filter.impl.RelativeLinkFilter
 
doFilter(BaseLinkFilter, String, String) - Method in class com.yishuifengxiao.common.crawler.link.filter.impl.ShortLinkFilter
 
DomainStrategy - Class in com.yishuifengxiao.common.crawler.extractor.content.strategy.impl
域名提取策略
从输入数据里提取出所有的域名
DomainStrategy() - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.DomainStrategy
 
doWhenNoDuplicate(Task, RequestCache, Request) - Method in interface com.yishuifengxiao.common.crawler.scheduler.remover.DuplicateRemover
当前请求没有重复时需要进行的操作,一般来说,只需将该请求存入请求任务缓存器即可
doWhenNoDuplicate(Task, RequestCache, Request) - Method in class com.yishuifengxiao.common.crawler.scheduler.remover.HostDuplicateRemover
 
doWhenNoDuplicate(Task, RequestCache, Request) - Method in class com.yishuifengxiao.common.crawler.scheduler.remover.SimpleDuplicateRemover
 
down(WebDriver, Request) - Method in class com.yishuifengxiao.common.crawler.downloader.BaseDownloader
执行真正的下载操作
down(Request) - Method in class com.yishuifengxiao.common.crawler.downloader.BaseDownloader
 
down(Request) - Method in interface com.yishuifengxiao.common.crawler.downloader.Downloader
下载对应的请求资源
down(WebDriver, Request) - Method in class com.yishuifengxiao.common.crawler.downloader.impl.SeleniumDownloader
 
down(Request) - Method in class com.yishuifengxiao.common.crawler.downloader.impl.SimpleDownloader
 
down(String, SiteRule, Downloader) - Method in class com.yishuifengxiao.common.crawler.simulator.SimpleSimulator
 
down(String, SiteRule, Downloader) - Method in interface com.yishuifengxiao.common.crawler.simulator.Simulator
测试网页下载功能
Downloader - Interface in com.yishuifengxiao.common.crawler.downloader
网页下载器
负责根据下载请求任务从指定的网站下载数据
DuplicateRemover - Interface in com.yishuifengxiao.common.crawler.scheduler.remover
请求去重器
用于在调度器存储任务之前进行重复任务判断,判断该任务是否已经存在过

E

EmailStrategy - Class in com.yishuifengxiao.common.crawler.extractor.content.strategy.impl
邮箱提取策略
从输入数据里提取出所有的邮箱地址
EmailStrategy() - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.EmailStrategy
 
equals(Object) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
 
ExcludePathMatcher - Class in com.yishuifengxiao.common.crawler.macther.impl
排除匹配器
被匹配的目标里不能包含包含指定的关键词
ExcludePathMatcher(String) - Constructor for class com.yishuifengxiao.common.crawler.macther.impl.ExcludePathMatcher
 
exist(Task, Request) - Method in class com.yishuifengxiao.common.crawler.cache.InMemoryRequestCache
 
exist(Task, Request) - Method in class com.yishuifengxiao.common.crawler.cache.RedisRequestCache
 
exist(Task, Request) - Method in interface com.yishuifengxiao.common.crawler.cache.RequestCache
先查找请求任务是否在集合中存在
exitOnBlock(Task) - Method in interface com.yishuifengxiao.common.crawler.listener.CrawlerListener
任务因为被目标服务器封杀而退出
exitOnBlock(Task) - Method in class com.yishuifengxiao.common.crawler.listener.SimpleCrawlerListener
 
exitOnFinish(Task) - Method in interface com.yishuifengxiao.common.crawler.listener.CrawlerListener
任务因为已经完成而退出
exitOnFinish(Task) - Method in class com.yishuifengxiao.common.crawler.listener.SimpleCrawlerListener
 
extis(String) - Method in class com.yishuifengxiao.common.crawler.domain.entity.CrawlerData
判断是否存在某个键的数据
extract(ContentRule, List<ExtractRule>, Page) - Method in interface com.yishuifengxiao.common.crawler.content.ContentExtract
从网页内容里解析出所有符合要求的数据
extract(ContentRule, List<ExtractRule>, Page) - Method in class com.yishuifengxiao.common.crawler.content.ContentExtractDecorator
 
extract(ContentRule, List<ExtractRule>, Page) - Method in class com.yishuifengxiao.common.crawler.content.impl.SimpleContentExtract
 
extract(Page) - Method in class com.yishuifengxiao.common.crawler.extractor.content.AbstractContentExtractor
提取数据
extract(Page, List<ExtractFieldRule>) - Method in class com.yishuifengxiao.common.crawler.extractor.content.AbstractContentExtractor
根据提取规则对输入数据进行提取
extract(Page) - Method in interface com.yishuifengxiao.common.crawler.extractor.content.ContentExtractor
从网页里提取出对应的数据
extract(Page) - Method in class com.yishuifengxiao.common.crawler.extractor.content.impl.CharsetContentExtractor
 
extract(Page) - Method in class com.yishuifengxiao.common.crawler.extractor.content.impl.DescpContentExtractor
 
extract(Page) - Method in class com.yishuifengxiao.common.crawler.extractor.content.impl.KeywordContentExtractor
 
extract(Page) - Method in class com.yishuifengxiao.common.crawler.extractor.content.impl.TitleContentExtractor
 
extract(Page, List<ExtractFieldRule>) - Method in class com.yishuifengxiao.common.crawler.extractor.content.SimpleContentExtractor
提取数据
extract(String, String, String) - Method in class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.AllStrategy
不对输入参数进行处理, 直接返回输入数据
extract(String, String, String) - Method in class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.ArrayStrategy
根据规则提取输入字符里的数据
extract(String, String, String) - Method in class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.ChnStrategy
根据规则提取输入字符里的数据
extract(String, String, String) - Method in class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.ConstantStrategy
根据规则提取输入字符里的数据
extract(String, String, String) - Method in class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.CssStrategy
根据css选择器和属性按照CSS提取模式从输入数据里提取出对应区域的数据(可能是HTML片段)
extract(String, String, String) - Method in class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.CssTextStrategy
此模式下只会包含内部的数据,不会包含外部html
extract(String, String, String) - Method in class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.DomainStrategy
从输入数据里提取出所有的域名
extract(String, String, String) - Method in class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.EmailStrategy
从输入数据里提取出所有的邮箱地址
extract(String, String, String) - Method in class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.NumStrategy
从输入数据里提取出所有的数字
extract(String, String, String) - Method in class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.RegexStrategy
按照正则表达式从输入数据里提取出所有符合正则表达式的信息
extract(String, String, String) - Method in class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.RemoveStrategy
根据参数移除输入数据里指定的信息
extract(String, String, String) - Method in class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.ReplaceStrategy
根据参数将输入数据中的原始字符替换为目标字符
extract(String, String, String) - Method in class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.ScriptStrategy
通过js脚本从输入参数中提取数据
extract(String, String, String) - Method in class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.SubstrStrategy
根据参数从输入数据中截取指定长度的字符
extract(String, String, String) - Method in class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.SystemStrategy
将输入数据中的系统占位符[@<yishui>@]替换为指定的字符
extract(String, String, String) - Method in class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.UrlStrategy
从输入数据里提取出所有的url
extract(String, String, String) - Method in class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.XpathStrategy
根据参数按照XPATH方式从输入数据中提取出所有符合要求的数据
extract(String, String, String) - Method in interface com.yishuifengxiao.common.crawler.extractor.content.strategy.Strategy
根据规则提取输入字符里的数据
extract(Page) - Method in class com.yishuifengxiao.common.crawler.extractor.links.impl.SimpleLinkExtractor
 
extract(Page) - Method in interface com.yishuifengxiao.common.crawler.extractor.links.LinkExtractor
提取链接
extract(LinkRule, Page) - Method in interface com.yishuifengxiao.common.crawler.link.LinkExtract
提取出网页里所有的链接
extract(LinkRule, Page) - Method in class com.yishuifengxiao.common.crawler.link.LinkExtractDecorator
 
extract(String, SiteRule, ExtractRule, Downloader) - Method in class com.yishuifengxiao.common.crawler.simulator.SimpleSimulator
 
extract(String, SiteRule, ExtractRule, Downloader) - Method in interface com.yishuifengxiao.common.crawler.simulator.Simulator
提取测试
extract(String, String) - Static method in class com.yishuifengxiao.common.crawler.utils.RegexFactory
根据正则表达式从内容中提取出一组匹配的内容
只要匹配到第一组数据就会返回
extractAll(String, String) - Static method in class com.yishuifengxiao.common.crawler.utils.RegexFactory
根据正则表达式从内容中提取出所有匹配的内容
返回所有匹配的数据
extractDomain(String) - Static method in class com.yishuifengxiao.common.crawler.utils.LinkUtils
从url中提取出来域名
ExtractFieldRule - Class in com.yishuifengxiao.common.crawler.domain.model
属性提取规则
定义了如何提取数据
ExtractFieldRule() - Constructor for class com.yishuifengxiao.common.crawler.domain.model.ExtractFieldRule
 
ExtractorFactory - Class in com.yishuifengxiao.common.crawler.extractor
提取器生成工厂
根据提取规则生成对应提取器
ExtractorFactory() - Constructor for class com.yishuifengxiao.common.crawler.extractor.ExtractorFactory
 
extractProtocol(String) - Static method in class com.yishuifengxiao.common.crawler.utils.LinkUtils
从url中提取出协议
extractProtocolAndHost(String) - Static method in class com.yishuifengxiao.common.crawler.utils.LinkUtils
从url中提取出来协议和域名
extractRule(String) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
根据内容提取规则的编码获取 内容提取规则
ExtractRule - Class in com.yishuifengxiao.common.crawler.domain.model
内容提取规则
内容提取规则中包含了一组属性提取规则,定义了如何提取一项数据
ExtractRule() - Constructor for class com.yishuifengxiao.common.crawler.domain.model.ExtractRule
 
extractRules() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取所有的内容提取规则
extractTopLevelDomain(String) - Static method in class com.yishuifengxiao.common.crawler.utils.LinkUtils
从url中提取出来域名

F

failCount - Variable in class com.yishuifengxiao.common.crawler.CrawlerProcessor
本实例解析失败的任务数据
failureMark() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取失败标志
下载内容里包含此值时表示被服务器拦截,使用正则表达式,如果为空则不进行此校验
failureMark(String) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置失败标志
下载内容里包含此值时表示被服务器拦截,使用正则表达式,如果为空则不进行此校验
fieldExtractRule(String) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
根据内容提取规则的编码获取该内容提取规则的提取规则
find(String, String) - Static method in class com.yishuifengxiao.common.crawler.utils.RegexFactory
判断内容是否包正则表达式标识的内容

G

get(Rule) - Static method in class com.yishuifengxiao.common.crawler.extractor.content.strategy.StrategyFactory
根据规则生成提取策略
get() - Static method in class com.yishuifengxiao.common.crawler.utils.LocalCrawler
获取风铃虫任务信息
getActiveCount() - Method in class com.yishuifengxiao.common.crawler.CrawlerProcessor
获取到风铃虫线程池里最大活跃的线程数
getAllData() - Method in class com.yishuifengxiao.common.crawler.domain.entity.CrawlerData
获取所有的数据
getAllHeaders() - Method in class com.yishuifengxiao.common.crawler.domain.model.SiteRule
ss 获取到全部的请求头信息
getAllTaskCount() - Method in class com.yishuifengxiao.common.crawler.Crawler
获取所有的任务总数
注意此数量是在变化的,且应该在任务启动后调用
getAllTaskCount() - Method in interface com.yishuifengxiao.common.crawler.Task
获取所有的任务总数
注意此数量是在变化的,且应该在任务启动后调用
getAutoUserAgent() - Method in class com.yishuifengxiao.common.crawler.domain.model.SiteRule
获取浏览器标识符
getBoolean(String) - Method in class com.yishuifengxiao.common.crawler.domain.entity.CrawlerData
获取Boolean类型的数据
getCode() - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
 
getConnectTimeout() - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
获取确定连接建立之前的超时时间
getContentExtract() - Method in class com.yishuifengxiao.common.crawler.Crawler
获取内容解析器
getContentExtractor(ExtractRule) - Method in class com.yishuifengxiao.common.crawler.extractor.AbstractExtractorFactory
根据内容提取规则生成内容提取器
getContentExtractor(ExtractRule) - Method in class com.yishuifengxiao.common.crawler.extractor.ExtractorFactory
根据内容提取规则生成内容提取器
getCookies() - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
获取请求时的携带的cookie信息
getCookiValues() - Method in class com.yishuifengxiao.common.crawler.domain.model.SiteRule
获取到全部的cookie信息
getCount(Task) - Method in class com.yishuifengxiao.common.crawler.cache.InMemoryRequestCache
 
getCount(Task) - Method in class com.yishuifengxiao.common.crawler.cache.RedisRequestCache
 
getCount(Task) - Method in interface com.yishuifengxiao.common.crawler.cache.RequestCache
获取指定缓存集合的请求任务数量
getCrawlerListener() - Method in class com.yishuifengxiao.common.crawler.Crawler
获取事件监听器
getCrawlerRule() - Method in class com.yishuifengxiao.common.crawler.Crawler
获取风铃虫定义规则
getCrawlerRule() - Method in interface com.yishuifengxiao.common.crawler.Task
获取任务的定义规则
getData(String) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
根据键获取对应的解析出来的值
getData() - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
获取该页面里所有提取出来的数据
getDouble(String) - Method in class com.yishuifengxiao.common.crawler.domain.entity.CrawlerData
获取Double类型的数据
getDownloader() - Method in class com.yishuifengxiao.common.crawler.Crawler
获取网页下载器
getDuplicateRemover() - Method in class com.yishuifengxiao.common.crawler.Crawler
获取请求去重器
getExtra() - Method in class com.yishuifengxiao.common.crawler.Crawler
获取创建风铃虫实例时传递的额外数据信息
getExtra() - Method in interface com.yishuifengxiao.common.crawler.Task
获取创建风铃虫实例时传递的额外数据信息
getExtractedTaskCount() - Method in class com.yishuifengxiao.common.crawler.Crawler
获取本实例已经解析成功的网页的数量
注意此数量是在变化的,且应该在任务启动后调用
getExtractedTaskCount() - Method in interface com.yishuifengxiao.common.crawler.Task
获取本实例已经解析成功的网页的数量
注意此数量是在变化的,且应该在任务启动后调用
getExtras() - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
获取请求时需要携带的额外的参数
getFailTaskCount() - Method in class com.yishuifengxiao.common.crawler.Crawler
获取本实例已经解析失败的网页的数量
注意此数量是在变化的,且应该在任务启动后调用
getFailTaskCount() - Method in interface com.yishuifengxiao.common.crawler.Task
获取本实例已经解析失败的网页的数量
注意此数量是在变化的,且应该在任务启动后调用
getFloat(String) - Method in class com.yishuifengxiao.common.crawler.domain.entity.CrawlerData
获取Float的数据
getHeaders() - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
获取当前请求时的请求头
getInt(String) - Method in class com.yishuifengxiao.common.crawler.domain.entity.CrawlerData
获取int类型的数据
getLinkExtract() - Method in class com.yishuifengxiao.common.crawler.Crawler
设置链接解析器
getLinkExtractor() - Method in class com.yishuifengxiao.common.crawler.extractor.AbstractExtractorFactory
生成链接提取器
getLinkExtractor() - Method in class com.yishuifengxiao.common.crawler.extractor.ExtractorFactory
生成链接提取器
getLinks() - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
 
getMatcher(MatcherRule) - Method in class com.yishuifengxiao.common.crawler.macther.MatcherFactory
根据链接匹配规则生成对应的匹配器
getMethod() - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
获取请求的方法
getName() - Method in class com.yishuifengxiao.common.crawler.Crawler
获取风铃虫实例的名字
getName() - Method in class com.yishuifengxiao.common.crawler.extractor.content.AbstractContentExtractor
 
getName() - Method in interface com.yishuifengxiao.common.crawler.extractor.content.ContentExtractor
获取内容提取器的名字
getName() - Method in class com.yishuifengxiao.common.crawler.extractor.content.impl.CharsetContentExtractor
 
getName() - Method in class com.yishuifengxiao.common.crawler.extractor.content.impl.DescpContentExtractor
 
getName() - Method in class com.yishuifengxiao.common.crawler.extractor.content.impl.KeywordContentExtractor
 
getName() - Method in class com.yishuifengxiao.common.crawler.extractor.content.impl.TitleContentExtractor
 
getName() - Method in class com.yishuifengxiao.common.crawler.pool.SimpleThreadFactory
 
getName() - Method in interface com.yishuifengxiao.common.crawler.Task
获取风铃虫实例的名字
getObject(String) - Method in class com.yishuifengxiao.common.crawler.domain.entity.CrawlerData
获取Object类型的数据
getPipeline() - Method in class com.yishuifengxiao.common.crawler.Crawler
获取信息输出器
getPriority() - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
获取请求的优先级
getRawTxt() - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
 
getRedirectUrl() - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
获取具备重定向功能的下载器在请求时重定向之后的地址
getRedisTemplate() - Method in class com.yishuifengxiao.common.crawler.cache.RedisRequestCache
 
getRedisTemplate() - Method in class com.yishuifengxiao.common.crawler.scheduler.impl.RedisScheduler
 
getReferrer() - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
获取请求的来源地址
getRequest() - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
 
getRequestCache() - Method in class com.yishuifengxiao.common.crawler.Crawler
设置资源缓存器
getScheduler() - Method in class com.yishuifengxiao.common.crawler.Crawler
获取资源调度器
getStartTime() - Method in class com.yishuifengxiao.common.crawler.Crawler
获取风铃虫实例的启动时间
getStartTime() - Method in interface com.yishuifengxiao.common.crawler.Task
获取任务的启动时间
getStatu() - Method in class com.yishuifengxiao.common.crawler.Crawler
获取风铃虫的状态
getStatu() - Method in class com.yishuifengxiao.common.crawler.CrawlerProcessor
获取实例的状态
getStatu() - Method in interface com.yishuifengxiao.common.crawler.Task
获取任务的状态
getStatuObserver() - Method in class com.yishuifengxiao.common.crawler.Crawler
获取状态监听器
getString(String) - Method in class com.yishuifengxiao.common.crawler.domain.entity.CrawlerData
获取字符串类型的数据
getThreadPool() - Method in class com.yishuifengxiao.common.crawler.Crawler
 
getThreadPool() - Method in class com.yishuifengxiao.common.crawler.CrawlerProcessor
获取线程池
getUrl() - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
获取请求的目标地址
getUserAgent() - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
获取浏览器标志
getUuid() - Method in class com.yishuifengxiao.common.crawler.Crawler
该实例唯一的标识符
getUuid() - Method in interface com.yishuifengxiao.common.crawler.Task
获取到风铃虫实例的唯一ID

H

HASH_ADDR - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.RuleConstant
哈希链接地址的标志
hashCode() - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
 
HashLinkFilter - Class in com.yishuifengxiao.common.crawler.link.filter.impl
哈希链接过滤器
当抓取出来的链接为相对地址,且地址的开头为 # 时,该地址可能是哈希地址,不对其进行处理
该过滤器最好作用在相对链接过滤器的前面
HashLinkFilter(BaseLinkFilter) - Constructor for class com.yishuifengxiao.common.crawler.link.filter.impl.HashLinkFilter
 
HeaderRule - Class in com.yishuifengxiao.common.crawler.domain.model
请求头参数配置信息
HeaderRule() - Constructor for class com.yishuifengxiao.common.crawler.domain.model.HeaderRule
 
headers() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取所有的请求头参数
HostDuplicateRemover - Class in com.yishuifengxiao.common.crawler.scheduler.remover
无查询参数去重器
去除URL上所有的查询参数的简单去重器
HostDuplicateRemover() - Constructor for class com.yishuifengxiao.common.crawler.scheduler.remover.HostDuplicateRemover
 
HTTP - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.SiteConstant
Http请求
HttpLinkFilter - Class in com.yishuifengxiao.common.crawler.link.filter.impl
网络地址链接过滤器
处理网络地址链接,直接将其输出
HttpLinkFilter(BaseLinkFilter) - Constructor for class com.yishuifengxiao.common.crawler.link.filter.impl.HttpLinkFilter
 
HTTPS - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.SiteConstant
Https请求

I

ILLEGAL_LINKS_SUFFIX - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.RuleConstant
非法表达式的匹配规则
IllegalLinkFilter - Class in com.yishuifengxiao.common.crawler.link.filter.impl
非法链接过滤器
过滤掉非法的链接
对不符合预期规则的链接不参与后续处理
例如过滤掉空链接、 图片、css、js、字体文件链接等不需要下载的链接
IllegalLinkFilter(BaseLinkFilter) - Constructor for class com.yishuifengxiao.common.crawler.link.filter.impl.IllegalLinkFilter
 
IMPLICITLY_WAIT_MILLIS - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.SiteConstant
识别对象时的超时时间。过了这个时间如果对象还没找到的话就会抛出NoSuchElement异常。单位毫秒。
InMemoryRequestCache - Class in com.yishuifengxiao.common.crawler.cache
基于内存实现的请求任务缓存器
InMemoryRequestCache() - Constructor for class com.yishuifengxiao.common.crawler.cache.InMemoryRequestCache
 
INTERCEPT_RETRY_COUNT - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.SiteConstant
连续多次在下载内容中获取到失败标识时的重试此次,默认为5
interceptCount() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取拦截次数阀域值
连续多次在下载内容中获取到失败标识时的重试此次,超过此次数会关闭该风铃虫实例,默认为5
interceptCount(int) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置拦截次数阀域值
连续多次在下载内容中获取到失败标识时的重试此次,超过此次数会关闭该风铃虫实例,默认为5
interval() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取每次请求的间隔时间,单位为毫秒,间隔时间为0到该值得两倍之间的一个随机数
防止因频繁请求而导致服务器封杀
默认时间为10000毫秒(10秒)
interval(long) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置每次请求的间隔时间,单位为毫秒,间隔时间为0到该值得两倍之间的一个随机数
防止因频繁请求而导致服务器封杀,
默认时间为10000毫秒(10秒)
interval - Variable in class com.yishuifengxiao.common.crawler.domain.entity.CrawlerRule
每次请求的间隔时间,单位为毫秒,间隔时间为0到该值得两倍之间的一个随机数
防止因频繁请求而导致服务器封杀,该值必须不小于0,若该值为0表示不开启此功能
默认为10000 毫秒(10秒)
isActive() - Method in class com.yishuifengxiao.common.crawler.CrawlerProcessor
判断此处理器是否处于活跃状态
isEmpty() - Method in class com.yishuifengxiao.common.crawler.domain.entity.CrawlerData
数据是否为空
isRun() - Method in class com.yishuifengxiao.common.crawler.Crawler
风铃虫是否正在运行状态
isSkip() - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
 

K

KEY_WORD - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.NestConstant
网页SEO信息中的关键字
keyword(String) - Static method in class com.yishuifengxiao.common.crawler.utils.LinkUtils
从url里提取出简短域名信息
例如www.yishuifengxiao.com 的提取值为 yishuifengxiao
KeywordContentExtractor - Class in com.yishuifengxiao.common.crawler.extractor.content.impl
keywords提取器
提取网页中meta 区域中的keywords信息
KeywordContentExtractor() - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.impl.KeywordContentExtractor
 
KeywordPathMatcher - Class in com.yishuifengxiao.common.crawler.macther.impl
关键词匹配器
被匹配的目标里必须包含指定的关键词
KeywordPathMatcher(String) - Constructor for class com.yishuifengxiao.common.crawler.macther.impl.KeywordPathMatcher
 

L

LEFT_SLASH - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.RuleConstant
左斜杠
link() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取链接解析规则
link(LinkRule) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置链接解析规则
link(String, SiteRule, LinkRule, Downloader) - Method in class com.yishuifengxiao.common.crawler.simulator.SimpleSimulator
 
link(String, SiteRule, LinkRule, Downloader) - Method in interface com.yishuifengxiao.common.crawler.simulator.Simulator
测试链接提取
LinkExtract - Interface in com.yishuifengxiao.common.crawler.link
链接解析器
从网页的原始文本中提取出所有符合规则要求的链接
LinkExtractDecorator - Class in com.yishuifengxiao.common.crawler.link
简单链接解析器
功能如下:
1 从网页的原始文本中统一转换成网络地址形式
2 从转换后的地址里提取出所有符合要求的链接
LinkExtractDecorator(LinkExtract) - Constructor for class com.yishuifengxiao.common.crawler.link.LinkExtractDecorator
 
LinkExtractor - Interface in com.yishuifengxiao.common.crawler.extractor.links
链接提取器
从原始数据里提取出所有的链接
LinkRule - Class in com.yishuifengxiao.common.crawler.domain.model
链接解析规则
确定起始页和需要提取哪些链接,即通过种子链接提出后续所有的列表页和内容页的连接
LinkRule() - Constructor for class com.yishuifengxiao.common.crawler.domain.model.LinkRule
 
linkRules() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取链接提取规则
LinkUtils - Class in com.yishuifengxiao.common.crawler.utils
链接处理工具类
LinkUtils() - Constructor for class com.yishuifengxiao.common.crawler.utils.LinkUtils
 
LocalCrawler - Class in com.yishuifengxiao.common.crawler.utils
风铃虫任务信息线程缓存类
LocalCrawler() - Constructor for class com.yishuifengxiao.common.crawler.utils.LocalCrawler
 
lookAndCache(Task, Request) - Method in class com.yishuifengxiao.common.crawler.cache.InMemoryRequestCache
 
lookAndCache(Task, Request) - Method in class com.yishuifengxiao.common.crawler.cache.RedisRequestCache
 
lookAndCache(Task, Request) - Method in interface com.yishuifengxiao.common.crawler.cache.RequestCache
先查找请求任务是否在集合中存在,然后将该请求任务存储到此集合中

M

main(String[]) - Static method in class com.yishuifengxiao.common.crawler.utils.LinkUtils
 
match(PageRule, String) - Method in interface com.yishuifengxiao.common.crawler.content.matcher.ContentMatcher
判断网页内容是否符合匹配规则
match(PageRule, String) - Method in class com.yishuifengxiao.common.crawler.content.matcher.SimpleContentMatcher
 
match(String) - Method in class com.yishuifengxiao.common.crawler.macther.impl.ExcludePathMatcher
 
match(String) - Method in class com.yishuifengxiao.common.crawler.macther.impl.KeywordPathMatcher
 
match(String) - Method in class com.yishuifengxiao.common.crawler.macther.impl.RegexPathMatcher
 
match(String) - Method in class com.yishuifengxiao.common.crawler.macther.impl.SimplePathMatcher
 
match(String) - Method in interface com.yishuifengxiao.common.crawler.macther.PathMatcher
判断路径于给定的模式是否完全匹配匹配
match(String, SiteRule, ContentRule, Downloader) - Method in class com.yishuifengxiao.common.crawler.simulator.SimpleSimulator
 
match(String, SiteRule, ContentRule, Downloader) - Method in interface com.yishuifengxiao.common.crawler.simulator.Simulator
测试内容匹配
match(String, String) - Static method in class com.yishuifengxiao.common.crawler.utils.RegexFactory
判断内容是否符合正则表达式
matcherCaseSensitive() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取匹配时是否大小写敏感
matcherCaseSensitive(Boolean) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置匹配时是否大小写敏感
MatcherFactory - Class in com.yishuifengxiao.common.crawler.macther
匹配器工厂
MatcherFactory() - Constructor for class com.yishuifengxiao.common.crawler.macther.MatcherFactory
 
matcherFuzzy() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
匹配时是否为模糊匹配
matcherFuzzy(Boolean) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置匹配时是否为模糊匹配
matcherMode() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取匹配模式
matcherMode(Boolean) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置匹配模式
matcherPattern() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取 内容匹配参数
matcherPattern(String) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置内容匹配参数
MatcherRule - Class in com.yishuifengxiao.common.crawler.domain.model
链接过滤规则
MatcherRule() - Constructor for class com.yishuifengxiao.common.crawler.domain.model.MatcherRule
 
matcherTarget() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取期待匹配值
matcherTarget(String) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置 期待匹配值
matcherType() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取内容匹配类型
matcherType(Type) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置内容匹配类型
matchHttpRequest(String) - Static method in class com.yishuifengxiao.common.crawler.utils.LinkUtils
判断是否符合网络请求的地址形式
MAX_REDIRECTS - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.SiteConstant
返回要遵循的最大重定向数。 重定向次数的限制旨在防止无限循环
默认为50
MAX_REQUEST_DEPTH - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.SiteConstant
默认的最大请求深度
maxDepth() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取最大的请求深度
maxDepth(long) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置最大的请求深度
maxRedirects() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取要遵循的最大重定向数
maxRedirects(int) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置要遵循的最大重定向数

N

NestConstant - Class in com.yishuifengxiao.common.crawler.domain.constant
内置的一些常见属性常量定义
NestConstant() - Constructor for class com.yishuifengxiao.common.crawler.domain.constant.NestConstant
 
NETWORK_ADDR_LINK - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.RuleConstant
网络地址标志
newThread(Runnable) - Method in class com.yishuifengxiao.common.crawler.pool.SimpleThreadFactory
 
noDuplicate(Task, RequestCache, Request) - Method in interface com.yishuifengxiao.common.crawler.scheduler.remover.DuplicateRemover
判断当前请求是否重复
noDuplicate(Task, RequestCache, Request) - Method in class com.yishuifengxiao.common.crawler.scheduler.remover.HostDuplicateRemover
 
noDuplicate(Task, RequestCache, Request) - Method in class com.yishuifengxiao.common.crawler.scheduler.remover.SimpleDuplicateRemover
 
normalizeUri() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
确定客户端是否应规范请求中的URI
normalizeUri(boolean) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置客户端是否应规范请求中的URI
NOT_LINK - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.RuleConstant
a标签中非link的表达式
notEmpty() - Method in class com.yishuifengxiao.common.crawler.domain.entity.CrawlerData
数据是否不为空
NothingLinkFilter - Class in com.yishuifengxiao.common.crawler.link.filter.impl
占位链接过滤器
什么都不做,直接返回原始的url,一般用于占位,插入在过滤器链的最后一个位置
NothingLinkFilter(BaseLinkFilter) - Constructor for class com.yishuifengxiao.common.crawler.link.filter.impl.NothingLinkFilter
 
NumStrategy - Class in com.yishuifengxiao.common.crawler.extractor.content.strategy.impl
数字提取策略
从输入数据里提取出所有的数字
NumStrategy() - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.NumStrategy
 

O

onDownError(Task, Page, Exception) - Method in interface com.yishuifengxiao.common.crawler.listener.CrawlerListener
下载页面失败的消息
onDownError(Task, Page, Exception) - Method in class com.yishuifengxiao.common.crawler.listener.SimpleCrawlerListener
 
onDownSuccess(Task, Page) - Method in interface com.yishuifengxiao.common.crawler.listener.CrawlerListener
下载页面成功的消息
onDownSuccess(Task, Page) - Method in class com.yishuifengxiao.common.crawler.listener.SimpleCrawlerListener
 
onExtractError(Task, Page, Exception) - Method in interface com.yishuifengxiao.common.crawler.listener.CrawlerListener
解析页面失败的消息
onExtractError(Task, Page, Exception) - Method in class com.yishuifengxiao.common.crawler.listener.SimpleCrawlerListener
 
onExtractSuccess(Task, Page) - Method in interface com.yishuifengxiao.common.crawler.listener.CrawlerListener
解析页面成功的消息
onExtractSuccess(Task, Page) - Method in class com.yishuifengxiao.common.crawler.listener.SimpleCrawlerListener
 
onNullRquest(Task) - Method in interface com.yishuifengxiao.common.crawler.listener.CrawlerListener
获取的调度命令的请求的url为空时触发
onNullRquest(Task) - Method in class com.yishuifengxiao.common.crawler.listener.SimpleCrawlerListener
 

P

Page - Class in com.yishuifengxiao.common.crawler.domain.entity
风铃虫页面对象
Page(Request) - Constructor for class com.yishuifengxiao.common.crawler.domain.entity.Page
 
Page() - Constructor for class com.yishuifengxiao.common.crawler.domain.entity.Page
 
PAGE_LOAD_SCRIPT_TIME_OUT_MILLIS - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.SiteConstant
页面加载时的超时时间。因为WebDriver会等页面加载完毕再进行后面的操作,所以如果页面超过设置时间依然没有加载完成,那么WebDriver就会抛出异常。单位毫秒。
pageRule() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取内容匹配规则
pageRule(PageRule) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置内容匹配规则
PageRule - Class in com.yishuifengxiao.common.crawler.domain.model
内容页内容匹配规则
内容页内容匹配规则表明根据抓取内容判断该页面是否需要进行内容提取操作
PageRule() - Constructor for class com.yishuifengxiao.common.crawler.domain.model.PageRule
 
PathMatcher - Interface in com.yishuifengxiao.common.crawler.macther
路径匹配器
判断路径于给定的模式是否匹配
Pattern - Enum in com.yishuifengxiao.common.crawler.domain.eunm
链接匹配模式
pattern(String) - Static method in class com.yishuifengxiao.common.crawler.utils.RegexFactory
根据正则表达式获取Pattern对象
Pipeline - Interface in com.yishuifengxiao.common.crawler.pipeline
信息输出器
输出解析出来的数据
poll(Task) - Method in class com.yishuifengxiao.common.crawler.scheduler.impl.RedisScheduler
 
poll(Task) - Method in class com.yishuifengxiao.common.crawler.scheduler.impl.SimpleScheduler
 
poll(Task) - Method in interface com.yishuifengxiao.common.crawler.scheduler.Scheduler
从资源调度器里获取一个请求任务
poll(Task) - Method in class com.yishuifengxiao.common.crawler.scheduler.SchedulerDecorator
 
preHandle(Request, WebDriver) - Method in class com.yishuifengxiao.common.crawler.downloader.BaseDownloader
正式下载前的前置操作
可以在此操作中修改 Web浏览器对象,进行属性设置
preHandle(Request, WebDriver) - Method in class com.yishuifengxiao.common.crawler.downloader.impl.SeleniumDownloader
 
push(Task, Request) - Method in class com.yishuifengxiao.common.crawler.scheduler.impl.RedisScheduler
 
push(Task, Request) - Method in class com.yishuifengxiao.common.crawler.scheduler.impl.SimpleScheduler
 
push(Task, Request) - Method in interface com.yishuifengxiao.common.crawler.scheduler.Scheduler
接收所有的请求任并存储起来
push(Task, Request) - Method in class com.yishuifengxiao.common.crawler.scheduler.SchedulerDecorator
 
put(Task) - Static method in class com.yishuifengxiao.common.crawler.utils.LocalCrawler
放置一个风铃虫任务信息

Q

QUERY_SEPARATOR - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.RuleConstant
问号分隔符

R

recieve(CrawlerData) - Method in interface com.yishuifengxiao.common.crawler.pipeline.Pipeline
输出解析出来的数据
recieve(CrawlerData) - Method in class com.yishuifengxiao.common.crawler.pipeline.SimplePipeline
 
redirectsEnabled() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取是否应自动处理重定向
redirectsEnabled(boolean) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置是否应自动处理重定向
RedisRequestCache - Class in com.yishuifengxiao.common.crawler.cache
基于redis实现的请求记录器
RedisRequestCache(RedisTemplate<String, Object>) - Constructor for class com.yishuifengxiao.common.crawler.cache.RedisRequestCache
 
RedisScheduler - Class in com.yishuifengxiao.common.crawler.scheduler.impl
基于redis的资源调度器
RedisScheduler(RedisTemplate<String, Object>) - Constructor for class com.yishuifengxiao.common.crawler.scheduler.impl.RedisScheduler
 
referrer() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取请求来源页
此值为空时表示由内核智能处理
referrer(String) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置请求来源页
REGEX_DOMAIN - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.RuleConstant
提取域名
REGEX_MATCH_ALL - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.RuleConstant
匹配所有的url的正则表达
REGEX_PROTOCOL_AND_HOST - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.RuleConstant
提取协议和域名
REGEX_URL - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.RuleConstant
匹配所有的url的正则表达
RegexFactory - Class in com.yishuifengxiao.common.crawler.utils
正则表达式工厂
RegexFactory() - Constructor for class com.yishuifengxiao.common.crawler.utils.RegexFactory
 
RegexPathMatcher - Class in com.yishuifengxiao.common.crawler.macther.impl
正则匹配器
被匹配的内容必须符合指定的正则表达式
RegexPathMatcher(String) - Constructor for class com.yishuifengxiao.common.crawler.macther.impl.RegexPathMatcher
 
RegexStrategy - Class in com.yishuifengxiao.common.crawler.extractor.content.strategy.impl
正则提取策略
按照正则表达式从输入数据里提取出所有符合正则表达式的信息
RegexStrategy() - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.RegexStrategy
 
RelativeLinkFilter - Class in com.yishuifengxiao.common.crawler.link.filter.impl
相对地址链接过滤器
将抽取出来的相对地址转换成网络地址形式
RelativeLinkFilter(BaseLinkFilter) - Constructor for class com.yishuifengxiao.common.crawler.link.filter.impl.RelativeLinkFilter
 
relativeRedirectsAllowed() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取确定是否应拒绝相对重定向
relativeRedirectsAllowed(boolean) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置确定是否应拒绝相对重定向
remove(Task) - Method in class com.yishuifengxiao.common.crawler.cache.InMemoryRequestCache
 
remove(Task) - Method in class com.yishuifengxiao.common.crawler.cache.RedisRequestCache
 
remove(Task) - Method in interface com.yishuifengxiao.common.crawler.cache.RequestCache
移除指定的缓存集合
RemoveStrategy - Class in com.yishuifengxiao.common.crawler.extractor.content.strategy.impl
字符移除策略
根据参数移除输入数据里指定的信息
RemoveStrategy() - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.RemoveStrategy
 
ReplaceStrategy - Class in com.yishuifengxiao.common.crawler.extractor.content.strategy.impl
字符替换策略
根据参数将输入数据中的原始字符替换为目标字符
ReplaceStrategy() - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.ReplaceStrategy
 
Request - Class in com.yishuifengxiao.common.crawler.domain.entity
当前请求对象
Request(String, String) - Constructor for class com.yishuifengxiao.common.crawler.domain.entity.Request
 
Request(String, String, long) - Constructor for class com.yishuifengxiao.common.crawler.domain.entity.Request
 
REQUEST_HOSTORY - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.CrawlerConstant
所有请求的集合
REQUEST_INTERVAL_TIME - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.SiteConstant
平均每次请求的间隔时间,单位为毫秒
RequestCache - Interface in com.yishuifengxiao.common.crawler.cache
请求任务缓存器
主要是用于判断此请求任务是否曾经存在过,用于协助完成请求去重
功能如下:
1 查询此请求任务是否存在于指定的的请求任务集合中
2 将请求任务存储到指定名字的集合中
3 清空指定的请求任务缓存集合
4 查询指定请求任务缓存集合里的请求任务数量
RequestCreater - Interface in com.yishuifengxiao.common.crawler.scheduler.request
请求生成器
根据站点规则和请求任务生成一个完整的请求任务
RETRY_COUNT - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.SiteConstant
请求失败时重新执行此请求的次数
retryCount() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取请求失败时的重试次数
retryCount(int) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置请求失败时的重试次数
Rule - Enum in com.yishuifengxiao.common.crawler.domain.eunm
提取类型
RuleConstant - Class in com.yishuifengxiao.common.crawler.domain.constant
规则常量类
RuleConstant() - Constructor for class com.yishuifengxiao.common.crawler.domain.constant.RuleConstant
 
run() - Method in class com.yishuifengxiao.common.crawler.Crawler
同步启动一个一个风铃虫实例
run() - Method in class com.yishuifengxiao.common.crawler.CrawlerProcessor
 
run() - Method in class com.yishuifengxiao.common.crawler.CrawlerWorker
 

S

save(Task, Request) - Method in class com.yishuifengxiao.common.crawler.cache.InMemoryRequestCache
 
save(Task, Request) - Method in class com.yishuifengxiao.common.crawler.cache.RedisRequestCache
 
save(Task, Request) - Method in interface com.yishuifengxiao.common.crawler.cache.RequestCache
将请求任务存储到指定的集合名中
SC_INTERNAL_SERVER_ERROR - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.CrawlerConstant
默认的请求异常时的响应码,500
SC_OK - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.CrawlerConstant
默认的请求成功时的响应码, 200
Scheduler - Interface in com.yishuifengxiao.common.crawler.scheduler
资源调度器
负责资源的调度管理工作
功能如下:
1.
SchedulerDecorator - Class in com.yishuifengxiao.common.crawler.scheduler
资源调度器装饰者
负责资源的调度管理工作
功能如下:
1.
SchedulerDecorator(RequestCache, Scheduler, DuplicateRemover) - Constructor for class com.yishuifengxiao.common.crawler.scheduler.SchedulerDecorator
 
SCRIPT_TIME_OUT_MILLIS - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.SiteConstant
异步脚本的超时时间。WebDriver可以异步执行脚本,这个是设置异步执行脚本脚本返回结果的超时时间。单位毫秒。
ScriptStrategy - Class in com.yishuifengxiao.common.crawler.extractor.content.strategy.impl
脚本提取器
通过js脚本从输入参数中提取数据
示例脚本如下:
ScriptStrategy() - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.ScriptStrategy
 
SeleniumDownloader - Class in com.yishuifengxiao.common.crawler.downloader.impl
基于Firefox的下载器
使用selenium-java实现
SeleniumDownloader(String, long) - Constructor for class com.yishuifengxiao.common.crawler.downloader.impl.SeleniumDownloader
构造函数
必须传入浏览器驱动文件geckodriver所在地址的路径
geckodriver文件的下载路径为 https://github.com/mozilla/geckodriver/releases
请根据运行环境的信息配置好此参数
SEPARATOR - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.CrawlerConstant
提取时出现多条数据拼接的标识符
setCode(int) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
 
setConnectTimeout(int) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
设置确定连接建立之前的超时时间
setContentExtract(ContentExtract) - Method in class com.yishuifengxiao.common.crawler.Crawler
设置内容解析器
setCookies(Map<String, String>) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
设置请求时的携带的cookie信息
setCrawlerListener(CrawlerListener) - Method in class com.yishuifengxiao.common.crawler.Crawler
设置事件监听器
setData(Map<String, Object>) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
设置输出数据
会替换原始的输出输出
setDownloader(Downloader) - Method in class com.yishuifengxiao.common.crawler.Crawler
设置网页下载器
setDuplicateRemover(DuplicateRemover) - Method in class com.yishuifengxiao.common.crawler.Crawler
设置请求去重器
setExtra(Map<String, Object>) - Method in class com.yishuifengxiao.common.crawler.Crawler
设置风铃虫携带的额外信息
此设置会清空原始的额外信息
setExtra(String, Object) - Method in class com.yishuifengxiao.common.crawler.Crawler
设置风铃虫额外信息
setExtractRules(List<ExtractRule>) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置 内容提取规则
会清空原始的内容提取规则
setExtractRules(String, List<ExtractFieldRule>) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
根据内容提取规则的编码设置该内容提取规则的提取规则
setExtras(Map<String, Object>) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
设置请求时需要携带的额外的参数
setHeaders(List<HeaderRule>) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
清空原始值后再设置请求头参数
setHeaders(Map<String, String>) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
设置当前请求时的请求头
setLinkExtract(LinkExtract) - Method in class com.yishuifengxiao.common.crawler.Crawler
设置链接解析器
setLinkRules(Set<MatcherRule>) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
清空原始链接提取规则后设置链接提取规则
setLinks(List<String>) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
设置链接地址
会替换原来的链接地址集合
setMethod(String) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
设置 请求的方法
setName(String) - Method in class com.yishuifengxiao.common.crawler.Crawler
设置风铃虫实例的名字
setName(String) - Method in class com.yishuifengxiao.common.crawler.pool.SimpleThreadFactory
 
setPipeline(Pipeline) - Method in class com.yishuifengxiao.common.crawler.Crawler
设置信息输出器
setPriority(long) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
设置请求的优先级
setRawTxt(String) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
 
setRedirectUrl(String) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
设置 具备重定向功能的下载器在请求时重定向之后的地址
setRedisTemplate(RedisTemplate<String, Object>) - Method in class com.yishuifengxiao.common.crawler.cache.RedisRequestCache
 
setRedisTemplate(RedisTemplate<String, Object>) - Method in class com.yishuifengxiao.common.crawler.scheduler.impl.RedisScheduler
 
setReferrer(String) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
设置请求的来源地址
setRequest(Request) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
 
setRequestCache(RequestCache) - Method in class com.yishuifengxiao.common.crawler.Crawler
设置资源缓存器
setScheduler(Scheduler) - Method in class com.yishuifengxiao.common.crawler.Crawler
设置资源调度器
setSkip(boolean) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
 
setStatuObserver(StatuObserver) - Method in class com.yishuifengxiao.common.crawler.Crawler
设置状态监听器
setThreadPool(ThreadPoolExecutor) - Method in class com.yishuifengxiao.common.crawler.Crawler
 
setUrl(String) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
设置请求的目标地址
setUserAgent(String) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
设置浏览器标志
SHORT_ADDR_LINK - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.RuleConstant
简单地址标志
ShortLinkFilter - Class in com.yishuifengxiao.common.crawler.link.filter.impl
短链接链接过滤器
处理以双斜杠开头的链接,将其拼接成网络地址形式
ShortLinkFilter(BaseLinkFilter) - Constructor for class com.yishuifengxiao.common.crawler.link.filter.impl.ShortLinkFilter
 
SimpleContentExtract - Class in com.yishuifengxiao.common.crawler.content.impl
默认实现的简单内容解析器
调用内容提取器对输入内容里的数据进行解析
SimpleContentExtract(List<ContentExtractor>) - Constructor for class com.yishuifengxiao.common.crawler.content.impl.SimpleContentExtract
 
SimpleContentExtractor - Class in com.yishuifengxiao.common.crawler.extractor.content
简单内容提取器
SimpleContentExtractor(ExtractRule) - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.SimpleContentExtractor
 
SimpleContentMatcher - Class in com.yishuifengxiao.common.crawler.content.matcher
默认实现的内容匹配器
SimpleContentMatcher() - Constructor for class com.yishuifengxiao.common.crawler.content.matcher.SimpleContentMatcher
 
SimpleCrawlerListener - Class in com.yishuifengxiao.common.crawler.listener
默认的风铃虫处理事件监听器
不输出任何信息
SimpleCrawlerListener() - Constructor for class com.yishuifengxiao.common.crawler.listener.SimpleCrawlerListener
 
SimpleDownloader - Class in com.yishuifengxiao.common.crawler.downloader.impl
基于JSOUP实现的网页下载器
特性如下:
1.
SimpleDownloader() - Constructor for class com.yishuifengxiao.common.crawler.downloader.impl.SimpleDownloader
 
SimpleDuplicateRemover - Class in com.yishuifengxiao.common.crawler.scheduler.remover
全路径去重器
简单实现的请求去重器
SimpleDuplicateRemover() - Constructor for class com.yishuifengxiao.common.crawler.scheduler.remover.SimpleDuplicateRemover
 
SimpleLinkExtractor - Class in com.yishuifengxiao.common.crawler.extractor.links.impl
简单实现的链接提取器
SimpleLinkExtractor() - Constructor for class com.yishuifengxiao.common.crawler.extractor.links.impl.SimpleLinkExtractor
 
SimplePathMatcher - Class in com.yishuifengxiao.common.crawler.macther.impl
简单匹配器
不进行匹配,直接通过
SimplePathMatcher() - Constructor for class com.yishuifengxiao.common.crawler.macther.impl.SimplePathMatcher
 
SimplePipeline - Class in com.yishuifengxiao.common.crawler.pipeline
默认实现的信息输出器
输出信息到日志
SimplePipeline() - Constructor for class com.yishuifengxiao.common.crawler.pipeline.SimplePipeline
 
SimpleRequestCreater - Class in com.yishuifengxiao.common.crawler.scheduler.request
简单请求生成器
SimpleRequestCreater() - Constructor for class com.yishuifengxiao.common.crawler.scheduler.request.SimpleRequestCreater
 
SimpleScheduler - Class in com.yishuifengxiao.common.crawler.scheduler.impl
简单资源调度器
SimpleScheduler() - Constructor for class com.yishuifengxiao.common.crawler.scheduler.impl.SimpleScheduler
 
SimpleSimulator - Class in com.yishuifengxiao.common.crawler.simulator
简单的模拟提取器
SimpleSimulator() - Constructor for class com.yishuifengxiao.common.crawler.simulator.SimpleSimulator
 
SimpleStatuObserver - Class in com.yishuifengxiao.common.crawler.monitor
默认实现的风铃虫状态监视器
SimpleStatuObserver() - Constructor for class com.yishuifengxiao.common.crawler.monitor.SimpleStatuObserver
 
SimpleThreadFactory - Class in com.yishuifengxiao.common.crawler.pool
线程工厂
SimpleThreadFactory(String) - Constructor for class com.yishuifengxiao.common.crawler.pool.SimpleThreadFactory
 
Simulator - Interface in com.yishuifengxiao.common.crawler.simulator
提取测试器
用于测试风铃虫规则是否配置正确,请勿将此作为正式的批量抓取工具
SimulatorData - Class in com.yishuifengxiao.common.crawler.domain.entity
模拟结果数据
SimulatorData() - Constructor for class com.yishuifengxiao.common.crawler.domain.entity.SimulatorData
 
site() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取站点配置规则数据
site(SiteRule) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置站点配置规则数据
SiteConstant - Class in com.yishuifengxiao.common.crawler.domain.constant
站点规则常量类
SiteConstant() - Constructor for class com.yishuifengxiao.common.crawler.domain.constant.SiteConstant
 
SiteRule - Class in com.yishuifengxiao.common.crawler.domain.model
站点规则
SiteRule() - Constructor for class com.yishuifengxiao.common.crawler.domain.model.SiteRule
 
start() - Method in class com.yishuifengxiao.common.crawler.Crawler
异步启动一个一个风铃虫实例
start() - Method in interface com.yishuifengxiao.common.crawler.Task
异步启动风铃虫实例
startUrl() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取起始链接地址
多个起始链接之间用半角逗号隔开
startUrl(String) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置 起始链接地址
多个起始链接之间用半角逗号隔开
statCheck() - Method in class com.yishuifengxiao.common.crawler.domain.model.SiteRule
是否进行拦截检查
Statu - Enum in com.yishuifengxiao.common.crawler.domain.eunm
风铃虫状态
StatuObserver - Interface in com.yishuifengxiao.common.crawler.monitor
风铃虫状态观察者
监控风铃虫状态的变化
stop() - Method in class com.yishuifengxiao.common.crawler.Crawler
停止运行
stop() - Method in interface com.yishuifengxiao.common.crawler.Task
停止风铃虫实例
Strategy - Interface in com.yishuifengxiao.common.crawler.extractor.content.strategy
提取策略
根据对应的提取策略从输入数据里提取出对应的信息作为输出数据直接输出
若处理失败或输入的参数为非法值时,输出数据为空字符串
StrategyFactory - Class in com.yishuifengxiao.common.crawler.extractor.content.strategy
提取策略工厂
StrategyFactory() - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.strategy.StrategyFactory
 
SubstrStrategy - Class in com.yishuifengxiao.common.crawler.extractor.content.strategy.impl
字符截取策略
根据参数从输入数据中截取指定长度的字符
SubstrStrategy() - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.SubstrStrategy
 
SystemStrategy - Class in com.yishuifengxiao.common.crawler.extractor.content.strategy.impl
系统占位符替换策略
将输入数据中的系统占位符[@<yishui>@]替换为指定的字符
SystemStrategy() - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.SystemStrategy
 

T

Task - Interface in com.yishuifengxiao.common.crawler
风铃虫任务
taskCount - Variable in class com.yishuifengxiao.common.crawler.CrawlerProcessor
本实例解析成功的任务数据
testContent(String, SiteRule, ExtractRule) - Static method in class com.yishuifengxiao.common.crawler.Crawler
测试内容提取规则
使用默认下载器
testContent(String, SiteRule, ExtractRule, Downloader) - Static method in class com.yishuifengxiao.common.crawler.Crawler
测试内容提取规则
使用自定义下载器
testDown(String, SiteRule) - Static method in class com.yishuifengxiao.common.crawler.Crawler
测试网页下载器
使用默认下载器
testDown(String, SiteRule, Downloader) - Static method in class com.yishuifengxiao.common.crawler.Crawler
测试网页下载器
使用自定义下载器
testLink(String, SiteRule, LinkRule) - Static method in class com.yishuifengxiao.common.crawler.Crawler
测试链接提取规则
使用默认下载器
testLink(String, SiteRule, LinkRule, Downloader) - Static method in class com.yishuifengxiao.common.crawler.Crawler
测试链接提取规则
使用自定义下载器
testMatcher(String, SiteRule, ContentRule) - Static method in class com.yishuifengxiao.common.crawler.Crawler
内容匹配测试
testMatcher(String, SiteRule, ContentRule, Downloader) - Static method in class com.yishuifengxiao.common.crawler.Crawler
内容匹配测试
threadNum() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
风铃虫解析时线程数
默认线程数为1
threadNum(int) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置 风铃虫解析时线程数 默认线程数为1
threadNum - Variable in class com.yishuifengxiao.common.crawler.domain.entity.CrawlerRule
内容解析时使用到的线程数,默认值为 主机CPU的核心数
TITLE - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.NestConstant
网页SEO信息中的标题
TitleContentExtractor - Class in com.yishuifengxiao.common.crawler.extractor.content.impl
title提取器
提取网页中head 区域中的title信息
TitleContentExtractor() - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.impl.TitleContentExtractor
 
toString() - Method in class com.yishuifengxiao.common.crawler.Crawler
 
toString() - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
 
Type - Enum in com.yishuifengxiao.common.crawler.domain.eunm
内容匹配策略

U

update(Task, Statu) - Method in class com.yishuifengxiao.common.crawler.monitor.SimpleStatuObserver
 
update(Task, Statu) - Method in interface com.yishuifengxiao.common.crawler.monitor.StatuObserver
任务的状态发生了变化
URL - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.NestConstant
网页的url
UrlStrategy - Class in com.yishuifengxiao.common.crawler.extractor.content.strategy.impl
url提取策略
从输入数据里提取出所有的url
UrlStrategy() - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.UrlStrategy
 
USER_AGENT - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.SiteConstant
浏览器标志的请求头名字
USER_AGENT_360_VERSION_WINDOWS - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.UserAgentConstant
360浏览器 windows 版
USER_AGENT_AOYOU_VERSION_WINDOWS - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.UserAgentConstant
傲游(maxthon)浏览器
USER_AGENT_ARRAY - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.SiteConstant
浏览器标识符集合
USER_AGENT_AVANT_VERSION_WINDOWS - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.UserAgentConstant
Avant浏览器 windows 版
USER_AGENT_EDAG_VERSION_11_476 - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.UserAgentConstant
EDAG浏览器
USER_AGENT_FIREFOX_VERSION_70_0 - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.UserAgentConstant
火狐浏览器标识,windows 火狐70.0.1
USER_AGENT_FIREFOX_VERSION_MAC - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.UserAgentConstant
火狐浏览器标识,Mac版 火狐
USER_AGENT_GOOGLE_VERSION_75_0 - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.UserAgentConstant
谷歌浏览器标识,默认为 谷歌75.0
USER_AGENT_GOOGLE_VERSION_78_0 - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.UserAgentConstant
谷歌浏览器标识,默认为 谷歌78.0
USER_AGENT_GREEN_VERSION_WINDOWS - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.UserAgentConstant
Green Browser浏览器 windows 版
USER_AGENT_IE_VERSION_11_476 - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.UserAgentConstant
IE浏览器标识, IE 11.476
USER_AGENT_IE_VERSION_9_0 - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.UserAgentConstant
IE浏览器标识, IE 9.0
USER_AGENT_LBBROWSER_VERSION_WINDOWS - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.UserAgentConstant
猎豹浏览器 windows 版
USER_AGENT_OPERA_VERSION_MAC - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.UserAgentConstant
Opera浏览器 mac 版
USER_AGENT_OPERA_VERSION_WINDOWS - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.UserAgentConstant
Opera浏览器 Windows 版
USER_AGENT_QQ_VERSION_WINDOWS - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.UserAgentConstant
QQ浏览器 windows 版
USER_AGENT_QQTT_VERSION_WINDOWS - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.UserAgentConstant
腾讯TT浏览器 windows 版
USER_AGENT_SAFARI_VERSION_MAC - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.UserAgentConstant
safari 浏览器 Mac 版
USER_AGENT_SAFARI_VERSION_WINDOWS - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.UserAgentConstant
safari 浏览器 Windows 版
USER_AGENT_SOUGOU_VERSION_WINDOWS - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.UserAgentConstant
sogou浏览器 windows 版
USER_AGENT_UC_VERSION_WINDOWS - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.UserAgentConstant
UC浏览器 windows 版
USER_AGENT_WORLD_VERSION_WINDOWS - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.UserAgentConstant
世界之窗浏览器 windows 版
userAgent() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取浏览器标识 ,此值为空时表示每次请求都会随机从内置浏览器标识中选择一个
userAgent(String) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置浏览器标识
UserAgentConstant - Class in com.yishuifengxiao.common.crawler.domain.constant
浏览器标识常量类
UserAgentConstant() - Constructor for class com.yishuifengxiao.common.crawler.domain.constant.UserAgentConstant
 

V

valueOf(String) - Static method in enum com.yishuifengxiao.common.crawler.domain.eunm.Pattern
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum com.yishuifengxiao.common.crawler.domain.eunm.Rule
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum com.yishuifengxiao.common.crawler.domain.eunm.Statu
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum com.yishuifengxiao.common.crawler.domain.eunm.Type
Returns the enum constant of this type with the specified name.
values() - Static method in enum com.yishuifengxiao.common.crawler.domain.eunm.Pattern
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum com.yishuifengxiao.common.crawler.domain.eunm.Rule
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum com.yishuifengxiao.common.crawler.domain.eunm.Statu
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum com.yishuifengxiao.common.crawler.domain.eunm.Type
Returns an array containing the constants of this enum type, in the order they are declared.

W

WAIT_DOWN - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.CrawlerConstant
等待下载的URL的集合
WAIT_TIME_FOR_CLOSE - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.SiteConstant
连续多长时间没有新的求表明任务已经完成,单位为毫秒,默认值为300000毫秒(300秒)
waitTime() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
获取超时等待时间,单位为毫秒,默认为300000毫秒(300秒),连续间隔多长时间后没有新的请求任务表明此任务已经结束
默认为300000毫秒(300秒)
waitTime(long) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
设置超时等待时间,单位为毫秒,默认为300000毫秒(300秒),连续间隔多长时间后没有新的请求任务表明此任务已经结束
默认为300000毫秒(300秒)
waitTime - Variable in class com.yishuifengxiao.common.crawler.domain.entity.CrawlerRule
超时等待时间,单位为秒,默认为300000毫秒(300秒),连续间隔多长时间后没有新的请求任务表明此任务已经结束

X

XPATH_ALL_LINK - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.RuleConstant
获取所有链接的xpath表达式
XpathStrategy - Class in com.yishuifengxiao.common.crawler.extractor.content.strategy.impl
XPATH提取策略
根据参数按照XPATH方式从输入数据中提取出所有符合要求的数据
XpathStrategy() - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.XpathStrategy
 
A B C D E F G H I K L M N O P Q R S T U V W X 
Skip navigation links

Copyright © 2020 Pivotal Software, Inc.. All rights reserved.