点击右上方,关注开源中国OSC头条 ,获取最新技术资讯
WebCollector-Python
WebCollector-Python 是一个无须配置、便于二次开发的 Python 爬虫框架(内核),它提供精简的的 API,只需少量代码即可实现一个功能强大的爬虫。
WebCollector Java版本
WebCollector Java版相比WebCollector-Python具有更高的效率:
https://github.com/CrawlScript/WebCollector
安装
pip安装命令
pip install https://github.com/CrawlScript/WebCollector-Python/archive/master.zip
示例
Basic
快速入门
自动探测URL
demo_auto_news_crawler.py:
# coding=utf-8import webcollector as wcclass NewsCrawler(wc.RamCrawler): def __init__(self): super().__init__(auto_detect=True) self.num_threads = 10 self.add_seed("https://github.blog/") self.add_regex("https://github.blog/[0-9]+.*") def visit(self, page, detected): if page.match_url("https://github.blog/[0-9]+.*"): title = page.select("h1.lh-condensed")[0].text.strip() content = page.select("div.markdown-body")[0].text.replace("n", " ").strip() print("nURL: ", page.url) print("TITLE: ", title) print("CONTENT: ", content[:50], "...")crawler = NewsCrawler()crawler.start(10)
手动探测URL
demo_manual_news_crawler.py:
# coding=utf-8import webcollector as wcclass NewsCrawler(wc.RamCrawler): def __init__(self): super().__init__(auto_detect=False) self.num_threads = 10 self.add_seed("https://github.blog/") def visit(self, page, detected): detected.extend(page.links("https://github.blog/[0-9]+.*")) if page.match_url("https://github.blog/[0-9]+.*"): title = page.select("h1.lh-condensed")[0].text.strip() content = page.select("div.markdown-body")[0].text.replace("n", " ").strip() print("nURL: ", page.url) print("TITLE: ", title) print("CONTENT: ", content[:50], "...")crawler = NewsCrawler()crawler.start(10)
点击下方“了解更多”,获取软件下载地址。
↓↓↓
声明:本站部分文章及图片源自用户投稿,如本站任何资料有侵权请您尽早请联系jinwei@zod.com.cn进行处理,非常感谢!