WebCollector-Python—基于 Python开源络爬虫框架

点击右上方，关注开源中国OSC头条，获取最新技术资讯

WebCollector-Python

WebCollector-Python 是一个无须配置、便于二次开发的 Python 爬虫框架（内核），它提供精简的的 API，只需少量代码即可实现一个功能强大的爬虫。

WebCollector Java版本

WebCollector Java版相比WebCollector-Python具有更高的效率:
https://github.com/CrawlScript/WebCollector

安装

pip安装命令

pip install https://github.com/CrawlScript/WebCollector-Python/archive/master.zip

示例

Basic

demo_auto_news_crawler.py

demo_manual_news_crawler.py

快速入门

自动探测URL

demo_auto_news_crawler.py:

# coding=utf-8import webcollector as wcclass NewsCrawler(wc.RamCrawler): def __init__(self): super().__init__(auto_detect=True) self.num_threads = 10 self.add_seed("https://github.blog/") self.add_regex("https://github.blog/[0-9]+.*") def visit(self, page, detected): if page.match_url("https://github.blog/[0-9]+.*"): title = page.select("h1.lh-condensed")[0].text.strip() content = page.select("div.markdown-body")[0].text.replace("n", " ").strip() print("nURL: ", page.url) print("TITLE: ", title) print("CONTENT: ", content[:50], "...")crawler = NewsCrawler()crawler.start(10)

手动探测URL

demo_manual_news_crawler.py:

# coding=utf-8import webcollector as wcclass NewsCrawler(wc.RamCrawler): def __init__(self): super().__init__(auto_detect=False) self.num_threads = 10 self.add_seed("https://github.blog/") def visit(self, page, detected): detected.extend(page.links("https://github.blog/[0-9]+.*")) if page.match_url("https://github.blog/[0-9]+.*"): title = page.select("h1.lh-condensed")[0].text.strip() content = page.select("div.markdown-body")[0].text.replace("n", " ").strip() print("nURL: ", page.url) print("TITLE: ", title) print("CONTENT: ", content[:50], "...")crawler = NewsCrawler()crawler.start(10)

点击下方“了解更多”，获取软件下载地址。

↓↓↓

声明：本站部分文章及图片源自用户投稿，如本站任何资料有侵权请您尽早请联系jinwei@zod.com.cn进行处理,非常感谢！

WebCollector-Python—基于 Python开源 络爬虫框架

相关推荐

WebCollector-Python—基于 Python开源络爬虫框架