我正在尝试选择网站上的下一个按钮,它的链接文本带有向右箭头。当我使用“ scrappy shell”查看源代码时,会向我显示该字符作为其Unicode文字“ \ u2192”。由此,我开发了以下Scrapy CrawlSpider:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.loader.processor import MapCompose
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy import log, Request
from yelpscraper.items import YelpscraperItem
import re, urlparse
class YelpSpider(CrawlSpider):
name = 'yelp'
allowed_domains = ['yelp.com']
start_urls = ['http://www.yelp.com/search?find_desc=attorney&find_loc=Austin%2C+TX&start=0']
rules = (
Rule(LinkExtractor(allow=r'biz', restrict_xpaths='//*[contains(@class, "natural-search-result")]//a[@class="biz-name"]'), callback='parse_item', follow=True),
Rule(LinkExtractor(allow=r'start', restrict_xpaths=u'//a[contains(@class, "prev-next")]/text()[contains(., "\u2192")]'), follow=True)
)
def parse_item(self, response):
i = YelpscraperItem()
i['phone'] = self.beautify(response.xpath('//*[@class="biz-phone"]/text()').extract())
i['state'] = self.beautify(response.xpath('//span[@itemprop="addressRegion"]/text()').extract())
i['company'] = self.beautify(response.xpath('//h1[contains(@class, "biz-page-title")]/text()').extract())
website = i['website'] = self.beautify(response.xpath('//div[@class="biz-website"]/a/text()').extract())
yield i
请注意rules属性中的第二个元组声明,其中包含有问题的unicode字符:
Rule(LinkExtractor(allow=r'start', restrict_xpaths=u'//a[contains(@class, "prev-next")]/text()[contains(., "\u2192")]'), follow=True)
当我尝试运行此蜘蛛时,将得到以下回溯:
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Python27\lib\site-packages\twisted\internet\task.py", line 607, in _tick
taskObj._oneWorkUnit()
File "C:\Python27\lib\site-packages\twisted\internet\task.py", line 484, in _oneWorkUnit
result = next(self._iterator)
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\utils\defer.py", line 57, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\utils\defer.py", line 96, in iter_errback
yield next(it)
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\contrib\spidermiddleware\offsite.py", line 26, in process_spider_output
for x in result:
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\contrib\spidermiddleware\referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\contrib\spidermiddleware\urllength.py", line 33, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\contrib\spidermiddleware\depth.py", line 50, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\contrib\spiders\crawl.py", line 73, in _parse_response
for request_or_item in self._requests_to_follow(response):
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\contrib\spiders\crawl.py", line 52, in _requests_to_follow
links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\contrib\linkextractors\lxmlhtml.py", line 107, in extract_links
links = self._extract_links(doc, response.url, response.encoding, base_url)
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\linkextractor.py", line 94, in _extract_links
return self.link_extractor._extract_links(*args, **kwargs)
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\contrib\linkextractors\lxmlhtml.py", line 50, in _extract_links
for el, attr, attr_val in self._iter_links(selector._root):
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\contrib\linkextractors\lxmlhtml.py", line 38, in _iter_links
for el in document.iter(etree.Element):
exceptions.AttributeError: 'unicode' object has no attribute 'iter'
我要做的就是选择此链接,如果不使用此字符,我想不出一种选择它的方法。 (它根据页面移动)。无论如何,是否可以使用ASCII码或unicode进行选择?这似乎是造成问题的原因?
参考方案
根据文档,restrict_xpaths
应该是list
或str
。
您正在传递unicode
字符串。这就是为什么您会得到一个错误。
此外,您无需检查text()
,检查prev-next
类就足够了:
rules = (
Rule(LinkExtractor(allow=r'biz', restrict_xpaths='//*[contains(@class, "natural-search-result")]//a[@class="biz-name"]'),
callback='parse_item', follow=True),
Rule(LinkExtractor(allow=r'start', restrict_xpaths='//a[contains(@class, "prev-next")]'),
follow=True)
)
已测试(抓取时没有错误,它遵循分页)。
用大写字母拆分字符串,但忽略AAA Python Regex - python我的正则表达式:vendor = "MyNameIsJoe. I'mWorkerInAAAinc." ven = re.split(r'(?<=[a-z])[A-Z]|[A-Z](?=[a-z])', vendor) 以大写字母分割字符串,例如:'我的名字是乔。 I'mWorkerInAAAinc”变成…
您如何在列表内部调用一个字符串位置? - python我一直在做迷宫游戏。我首先决定制作一个迷你教程。游戏开发才刚刚开始,现在我正在尝试使其向上发展。我正在尝试更改PlayerAre变量,但是它不起作用。我试过放在列表内和列表外。maze = ["o","*","*","*","*","*",…
查找字符串中的行数 - python我正在创建一个python电影播放器/制作器,我想在多行字符串中找到行数。我想知道是否有任何内置函数或可以编写代码的函数来做到这一点:x = """ line1 line2 """ getLines(x) python大神给出的解决方案 如果换行符是'\n',则nlines …
字符串文字中的正斜杠表现异常 - python为什么S1和S2在撇号位置方面表现不同?S1="1/282/03/10" S2="4/107/03/10" R1="".join({"N\'" ,S1,"\'" }) R2="".join({"N\'…
在返回'Response'(Python)中传递多个参数 - python我在Angular工作,正在使用Http请求和响应。是否可以在“响应”中发送多个参数。角度文件:this.http.get("api/agent/applicationaware").subscribe((data:any)... python文件:def get(request): ... return Response(seriali…