- save(Task, Request) - Method in class com.yishuifengxiao.common.crawler.cache.InMemoryRequestCache
-
- save(Task, Request) - Method in class com.yishuifengxiao.common.crawler.cache.RedisRequestCache
-
- save(Task, Request) - Method in interface com.yishuifengxiao.common.crawler.cache.RequestCache
-
将请求任务存储到指定的集合名中
- SC_INTERNAL_SERVER_ERROR - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.CrawlerConstant
-
默认的请求异常时的响应码,500
- SC_OK - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.CrawlerConstant
-
默认的请求成功时的响应码, 200
- Scheduler - Interface in com.yishuifengxiao.common.crawler.scheduler
-
资源调度器
负责资源的调度管理工作
功能如下:
1.
- SchedulerDecorator - Class in com.yishuifengxiao.common.crawler.scheduler
-
资源调度器装饰者
负责资源的调度管理工作
功能如下:
1.
- SchedulerDecorator(RequestCache, Scheduler, DuplicateRemover) - Constructor for class com.yishuifengxiao.common.crawler.scheduler.SchedulerDecorator
-
- SCRIPT_TIME_OUT_MILLIS - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.SiteConstant
-
异步脚本的超时时间。WebDriver可以异步执行脚本,这个是设置异步执行脚本脚本返回结果的超时时间。单位毫秒。
- ScriptStrategy - Class in com.yishuifengxiao.common.crawler.extractor.content.strategy.impl
-
脚本提取器
通过js脚本从输入参数中提取数据
示例脚本如下:
- ScriptStrategy() - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.ScriptStrategy
-
- SeleniumDownloader - Class in com.yishuifengxiao.common.crawler.downloader.impl
-
基于Firefox的下载器
使用selenium-java实现
- SeleniumDownloader(String, long) - Constructor for class com.yishuifengxiao.common.crawler.downloader.impl.SeleniumDownloader
-
构造函数
必须传入浏览器驱动文件geckodriver所在地址的路径
geckodriver文件的下载路径为 https://github.com/mozilla/geckodriver/releases
请根据运行环境的信息配置好此参数
- SEPARATOR - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.CrawlerConstant
-
提取时出现多条数据拼接的标识符
- setCode(int) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
-
- setConnectTimeout(int) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
-
设置确定连接建立之前的超时时间
- setContentExtract(ContentExtract) - Method in class com.yishuifengxiao.common.crawler.Crawler
-
设置内容解析器
- setCookies(Map<String, String>) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
-
设置请求时的携带的cookie信息
- setCrawlerListener(CrawlerListener) - Method in class com.yishuifengxiao.common.crawler.Crawler
-
设置事件监听器
- setData(Map<String, Object>) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
-
设置输出数据
会替换原始的输出输出
- setDownloader(Downloader) - Method in class com.yishuifengxiao.common.crawler.Crawler
-
设置网页下载器
- setDuplicateRemover(DuplicateRemover) - Method in class com.yishuifengxiao.common.crawler.Crawler
-
设置请求去重器
- setExtra(Map<String, Object>) - Method in class com.yishuifengxiao.common.crawler.Crawler
-
设置风铃虫携带的额外信息
此设置会清空原始的额外信息
- setExtra(String, Object) - Method in class com.yishuifengxiao.common.crawler.Crawler
-
设置风铃虫额外信息
- setExtractRules(List<ExtractRule>) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
-
设置 内容提取规则
会清空原始的内容提取规则
- setExtractRules(String, List<ExtractFieldRule>) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
-
根据内容提取规则的编码设置该内容提取规则的提取规则
- setExtras(Map<String, Object>) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
-
设置请求时需要携带的额外的参数
- setHeaders(List<HeaderRule>) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
-
清空原始值后再设置请求头参数
- setHeaders(Map<String, String>) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
-
设置当前请求时的请求头
- setLinkExtract(LinkExtract) - Method in class com.yishuifengxiao.common.crawler.Crawler
-
设置链接解析器
- setLinkRules(Set<MatcherRule>) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
-
清空原始链接提取规则后设置链接提取规则
- setLinks(List<String>) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
-
设置链接地址
会替换原来的链接地址集合
- setMethod(String) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
-
设置 请求的方法
- setName(String) - Method in class com.yishuifengxiao.common.crawler.Crawler
-
设置风铃虫实例的名字
- setName(String) - Method in class com.yishuifengxiao.common.crawler.pool.SimpleThreadFactory
-
- setPipeline(Pipeline) - Method in class com.yishuifengxiao.common.crawler.Crawler
-
设置信息输出器
- setPriority(long) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
-
设置请求的优先级
- setRawTxt(String) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
-
- setRedirectUrl(String) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
-
设置 具备重定向功能的下载器在请求时重定向之后的地址
- setRedisTemplate(RedisTemplate<String, Object>) - Method in class com.yishuifengxiao.common.crawler.cache.RedisRequestCache
-
- setRedisTemplate(RedisTemplate<String, Object>) - Method in class com.yishuifengxiao.common.crawler.scheduler.impl.RedisScheduler
-
- setReferrer(String) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
-
设置请求的来源地址
- setRequest(Request) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
-
- setRequestCache(RequestCache) - Method in class com.yishuifengxiao.common.crawler.Crawler
-
设置资源缓存器
- setScheduler(Scheduler) - Method in class com.yishuifengxiao.common.crawler.Crawler
-
设置资源调度器
- setSkip(boolean) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Page
-
- setStatuObserver(StatuObserver) - Method in class com.yishuifengxiao.common.crawler.Crawler
-
设置状态监听器
- setThreadPool(ThreadPoolExecutor) - Method in class com.yishuifengxiao.common.crawler.Crawler
-
- setUrl(String) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
-
设置请求的目标地址
- setUserAgent(String) - Method in class com.yishuifengxiao.common.crawler.domain.entity.Request
-
设置浏览器标志
- SHORT_ADDR_LINK - Static variable in class com.yishuifengxiao.common.crawler.domain.constant.RuleConstant
-
简单地址标志
- ShortLinkFilter - Class in com.yishuifengxiao.common.crawler.link.filter.impl
-
短链接链接过滤器
处理以双斜杠开头的链接,将其拼接成网络地址形式
- ShortLinkFilter(BaseLinkFilter) - Constructor for class com.yishuifengxiao.common.crawler.link.filter.impl.ShortLinkFilter
-
- SimpleContentExtract - Class in com.yishuifengxiao.common.crawler.content.impl
-
默认实现的简单内容解析器
调用内容提取器对输入内容里的数据进行解析
- SimpleContentExtract(List<ContentExtractor>) - Constructor for class com.yishuifengxiao.common.crawler.content.impl.SimpleContentExtract
-
- SimpleContentExtractor - Class in com.yishuifengxiao.common.crawler.extractor.content
-
简单内容提取器
- SimpleContentExtractor(ExtractRule) - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.SimpleContentExtractor
-
- SimpleContentMatcher - Class in com.yishuifengxiao.common.crawler.content.matcher
-
默认实现的内容匹配器
- SimpleContentMatcher() - Constructor for class com.yishuifengxiao.common.crawler.content.matcher.SimpleContentMatcher
-
- SimpleCrawlerListener - Class in com.yishuifengxiao.common.crawler.listener
-
默认的风铃虫处理事件监听器
不输出任何信息
- SimpleCrawlerListener() - Constructor for class com.yishuifengxiao.common.crawler.listener.SimpleCrawlerListener
-
- SimpleDownloader - Class in com.yishuifengxiao.common.crawler.downloader.impl
-
基于JSOUP实现的网页下载器
特性如下:
1.
- SimpleDownloader() - Constructor for class com.yishuifengxiao.common.crawler.downloader.impl.SimpleDownloader
-
- SimpleDuplicateRemover - Class in com.yishuifengxiao.common.crawler.scheduler.remover
-
全路径去重器
简单实现的请求去重器
- SimpleDuplicateRemover() - Constructor for class com.yishuifengxiao.common.crawler.scheduler.remover.SimpleDuplicateRemover
-
- SimpleLinkExtractor - Class in com.yishuifengxiao.common.crawler.extractor.links.impl
-
简单实现的链接提取器
- SimpleLinkExtractor() - Constructor for class com.yishuifengxiao.common.crawler.extractor.links.impl.SimpleLinkExtractor
-
- SimplePathMatcher - Class in com.yishuifengxiao.common.crawler.macther.impl
-
简单匹配器
不进行匹配,直接通过
- SimplePathMatcher() - Constructor for class com.yishuifengxiao.common.crawler.macther.impl.SimplePathMatcher
-
- SimplePipeline - Class in com.yishuifengxiao.common.crawler.pipeline
-
默认实现的信息输出器
输出信息到日志
- SimplePipeline() - Constructor for class com.yishuifengxiao.common.crawler.pipeline.SimplePipeline
-
- SimpleRequestCreater - Class in com.yishuifengxiao.common.crawler.scheduler.request
-
简单请求生成器
- SimpleRequestCreater() - Constructor for class com.yishuifengxiao.common.crawler.scheduler.request.SimpleRequestCreater
-
- SimpleScheduler - Class in com.yishuifengxiao.common.crawler.scheduler.impl
-
简单资源调度器
- SimpleScheduler() - Constructor for class com.yishuifengxiao.common.crawler.scheduler.impl.SimpleScheduler
-
- SimpleSimulator - Class in com.yishuifengxiao.common.crawler.simulator
-
简单的模拟提取器
- SimpleSimulator() - Constructor for class com.yishuifengxiao.common.crawler.simulator.SimpleSimulator
-
- SimpleStatuObserver - Class in com.yishuifengxiao.common.crawler.monitor
-
默认实现的风铃虫状态监视器
- SimpleStatuObserver() - Constructor for class com.yishuifengxiao.common.crawler.monitor.SimpleStatuObserver
-
- SimpleThreadFactory - Class in com.yishuifengxiao.common.crawler.pool
-
线程工厂
- SimpleThreadFactory(String) - Constructor for class com.yishuifengxiao.common.crawler.pool.SimpleThreadFactory
-
- Simulator - Interface in com.yishuifengxiao.common.crawler.simulator
-
提取测试器
用于测试风铃虫规则是否配置正确,请勿将此作为正式的批量抓取工具
- SimulatorData - Class in com.yishuifengxiao.common.crawler.domain.entity
-
模拟结果数据
- SimulatorData() - Constructor for class com.yishuifengxiao.common.crawler.domain.entity.SimulatorData
-
- site() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
-
获取站点配置规则数据
- site(SiteRule) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
-
设置站点配置规则数据
- SiteConstant - Class in com.yishuifengxiao.common.crawler.domain.constant
-
站点规则常量类
- SiteConstant() - Constructor for class com.yishuifengxiao.common.crawler.domain.constant.SiteConstant
-
- SiteRule - Class in com.yishuifengxiao.common.crawler.domain.model
-
站点规则
- SiteRule() - Constructor for class com.yishuifengxiao.common.crawler.domain.model.SiteRule
-
- start() - Method in class com.yishuifengxiao.common.crawler.Crawler
-
异步启动一个一个风铃虫实例
- start() - Method in interface com.yishuifengxiao.common.crawler.Task
-
异步启动风铃虫实例
- startUrl() - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
-
获取起始链接地址
多个起始链接之间用半角逗号隔开
- startUrl(String) - Method in class com.yishuifengxiao.common.crawler.CrawlerBuilder
-
设置 起始链接地址
多个起始链接之间用半角逗号隔开
- statCheck() - Method in class com.yishuifengxiao.common.crawler.domain.model.SiteRule
-
是否进行拦截检查
- Statu - Enum in com.yishuifengxiao.common.crawler.domain.eunm
-
风铃虫状态
- StatuObserver - Interface in com.yishuifengxiao.common.crawler.monitor
-
风铃虫状态观察者
监控风铃虫状态的变化
- stop() - Method in class com.yishuifengxiao.common.crawler.Crawler
-
停止运行
- stop() - Method in interface com.yishuifengxiao.common.crawler.Task
-
停止风铃虫实例
- Strategy - Interface in com.yishuifengxiao.common.crawler.extractor.content.strategy
-
提取策略
根据对应的提取策略从输入数据里提取出对应的信息作为输出数据直接输出
若处理失败或输入的参数为非法值时,输出数据为空字符串
- StrategyFactory - Class in com.yishuifengxiao.common.crawler.extractor.content.strategy
-
提取策略工厂
- StrategyFactory() - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.strategy.StrategyFactory
-
- SubstrStrategy - Class in com.yishuifengxiao.common.crawler.extractor.content.strategy.impl
-
字符截取策略
根据参数从输入数据中截取指定长度的字符
- SubstrStrategy() - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.SubstrStrategy
-
- SystemStrategy - Class in com.yishuifengxiao.common.crawler.extractor.content.strategy.impl
-
系统占位符替换策略
将输入数据中的系统占位符[@<yishui>@]
替换为指定的字符
- SystemStrategy() - Constructor for class com.yishuifengxiao.common.crawler.extractor.content.strategy.impl.SystemStrategy
-