Package | Description |
---|---|
com.yishuifengxiao.common.crawler |
Modifier and Type | Method and Description |
---|---|
CrawlerBuilder |
CrawlerBuilder.addExtractRule(ExtractRule extractRule)
增加内容提取规则
|
CrawlerBuilder |
CrawlerBuilder.addExtractRule(String key,
ExtractFieldRule fieldExtractRule)
根据内容提取规则的编码增加该内容提取规则的提取规则
|
CrawlerBuilder |
CrawlerBuilder.addExtractRules(List<ExtractRule> list)
增加内容提取规则
|
CrawlerBuilder |
CrawlerBuilder.addFieldExtractRules(String key,
List<ExtractFieldRule> list)
根据内容提取规则的编码增加该内容提取规则的提取规则
|
CrawlerBuilder |
CrawlerBuilder.addHeader(HeaderRule headerRule)
增加一组请求头参数
|
CrawlerBuilder |
CrawlerBuilder.addHeaders(List<HeaderRule> list)
增加一组请求头参数
|
CrawlerBuilder |
CrawlerBuilder.addLinkRule(MatcherRule linkRule)
增加链接提取规则
|
CrawlerBuilder |
CrawlerBuilder.addLinkRules(Set<MatcherRule> linkRules)
增加链接提取规则
|
CrawlerBuilder |
CrawlerBuilder.cacheControl(String cacheControl)
设置网页缓存策略
默认为 max-age=0 |
CrawlerBuilder |
CrawlerBuilder.circularRedirectsAllowed(boolean circularRedirectsAllowed)
设置是否应允许循环重定向
|
CrawlerBuilder |
CrawlerBuilder.connectTimeout(int connectTimeout)
设置确定连接建立之前的超时时间(以毫秒为单位)
|
CrawlerBuilder |
CrawlerBuilder.content(ContentRule content)
设置内容解析规则
|
CrawlerBuilder |
CrawlerBuilder.contentCompressionEnabled(boolean contentCompressionEnabled)
设置是否请求目标服务器压缩内容
|
CrawlerBuilder |
CrawlerBuilder.contentPageRule(MatcherRule contentPageRule)
设置内容页地址规则
多个规则之间用半角逗号隔开 |
CrawlerBuilder |
CrawlerBuilder.cookieSpec(String cookieSpec)
设置 确定用于HTTP状态管理的cookie规范的名称
|
CrawlerBuilder |
CrawlerBuilder.cookieValue(String cookieValue)
设置请求时携带cookie信息
|
static CrawlerBuilder |
CrawlerBuilder.create()
创建一个默认风铃虫规则构建器
|
static CrawlerBuilder |
CrawlerBuilder.create(CrawlerRule crawlerRule)
根据已有规则 创建一个默认风铃虫规则构建器
|
CrawlerBuilder |
CrawlerBuilder.failureMark(String failureMark)
设置失败标志
下载内容里包含此值时表示被服务器拦截,使用正则表达式,如果为空则不进行此校验 |
CrawlerBuilder |
CrawlerBuilder.interceptCount(int interceptCount)
设置拦截次数阀域值
连续多次在下载内容中获取到失败标识时的重试此次,超过此次数会关闭该风铃虫实例,默认为5 |
CrawlerBuilder |
CrawlerBuilder.interval(long intervalInSeconds)
设置每次请求的间隔时间,单位为毫秒,间隔时间为0到该值得两倍之间的一个随机数
防止因频繁请求而导致服务器封杀, 默认时间为10000毫秒(10秒) |
CrawlerBuilder |
CrawlerBuilder.link(LinkRule link)
设置链接解析规则
|
CrawlerBuilder |
CrawlerBuilder.matcherCaseSensitive(Boolean matcherCaseSensitive)
设置匹配时是否大小写敏感
|
CrawlerBuilder |
CrawlerBuilder.matcherFuzzy(Boolean matcherFuzzy)
设置匹配时是否为模糊匹配
|
CrawlerBuilder |
CrawlerBuilder.matcherMode(Boolean matcherMode)
设置匹配模式
|
CrawlerBuilder |
CrawlerBuilder.matcherPattern(String matcherPattern)
设置内容匹配参数
|
CrawlerBuilder |
CrawlerBuilder.matcherTarget(String matcherTarget)
设置 期待匹配值
|
CrawlerBuilder |
CrawlerBuilder.matcherType(Type matcherType)
设置内容匹配类型
|
CrawlerBuilder |
CrawlerBuilder.maxDepth(long maxDepth)
设置最大的请求深度
|
CrawlerBuilder |
CrawlerBuilder.maxRedirects(int maxRedirects)
设置要遵循的最大重定向数
|
CrawlerBuilder |
CrawlerBuilder.normalizeUri(boolean normalizeUri)
设置客户端是否应规范请求中的URI
|
CrawlerBuilder |
CrawlerBuilder.pageRule(PageRule pageRule)
设置内容匹配规则
|
CrawlerBuilder |
CrawlerBuilder.redirectsEnabled(boolean redirectsEnabled)
设置是否应自动处理重定向
|
CrawlerBuilder |
CrawlerBuilder.referrer(String referrer)
设置请求来源页
|
CrawlerBuilder |
CrawlerBuilder.relativeRedirectsAllowed(boolean relativeRedirectsAllowed)
设置确定是否应拒绝相对重定向
|
CrawlerBuilder |
CrawlerBuilder.retryCount(int retryCount)
设置请求失败时的重试次数
|
CrawlerBuilder |
CrawlerBuilder.setExtractRules(List<ExtractRule> list)
设置 内容提取规则
会清空原始的内容提取规则 |
CrawlerBuilder |
CrawlerBuilder.setExtractRules(String key,
List<ExtractFieldRule> list)
根据内容提取规则的编码设置该内容提取规则的提取规则
|
CrawlerBuilder |
CrawlerBuilder.setHeaders(List<HeaderRule> list)
清空原始值后再设置请求头参数
|
CrawlerBuilder |
CrawlerBuilder.setLinkRules(Set<MatcherRule> linkRules)
清空原始链接提取规则后设置链接提取规则
|
CrawlerBuilder |
CrawlerBuilder.site(SiteRule site)
设置站点配置规则数据
|
CrawlerBuilder |
CrawlerBuilder.startUrl(String startUrl)
设置 起始链接地址
多个起始链接之间用半角逗号隔开 |
CrawlerBuilder |
CrawlerBuilder.threadNum(int threadNum)
设置 风铃虫解析时线程数 默认线程数为1
|
CrawlerBuilder |
CrawlerBuilder.userAgent(String userAgent)
设置浏览器标识
|
CrawlerBuilder |
CrawlerBuilder.waitTime(long waitTimeInSeconds)
设置超时等待时间,单位为毫秒,默认为300000毫秒(300秒),连续间隔多长时间后没有新的请求任务表明此任务已经结束
默认为300000毫秒(300秒) |
Copyright © 2020 Pivotal Software, Inc.. All rights reserved.