我是这里所有乐器的新手。我的目标是从许多页面中提取所有URL,这些页面通过“ Weiter” /“ next”按钮(多个URL)连接得更多。我决定尝试一下。该页面是动态生成的。然后,我了解到我还需要其他仪器,并为此安装了Splash。安装正在运行。我根据教程设置了安装。然后,我通过在搜索输入字段中发送“返回”来设法获得第一页。使用可以给我所需结果的浏览器。我的问题是,我尝试单击生成的页面上的“下一个”按钮,但不知道具体如何。正如我在几页上所读到的那样,这并不总是那么容易。我尝试了建议的解决方案,但没有成功。我想我不太远,希望能有所帮助。谢谢。
我的settings.py
BOT_NAME = 'gr'
SPIDER_MODULES = ['gr.spiders']
NEWSPIDER_MODULE = 'gr.spiders'
ROBOTSTXT_OBEY = True
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPLASH_URL = 'http://localhost:8050'
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
我的蜘蛛:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy_splash import SplashRequest
import json
# import base64
class GrSpider(scrapy.Spider):
name = 'gr_'
allowed_domains = ['lawsearch.gr.ch']
start_urls = ['http://www.lawsearch.gr.ch/le/']
def start_requests(self):
script = """
function main(splash)
assert(splash:go(splash.args.url))
splash:set_viewport_full()
splash:wait(0.3)
splash:send_keys("<Return>")
splash:wait(0.3)
return splash:html()
end
"""
for url in self.start_urls:
yield SplashRequest(url=url,
callback=self.parse,
endpoint='execute',
args={'lua_source': script})
def parse(self, response):
script3 = """
function main(splash)
splash:autoload{url="https://code.jquery.com/jquery-3.2.1.min.js"}
assert(splash:go(splash.args.url))
splash:set_viewport_full()
-- splash:wait(2.8)
-- local element = splash:select('.result-pager-next-active .simplebutton')
-- element:mouse_click()
-- local bounds = element:bounds()
-- assert(element:mouse_click{x=bounds.width, y=bounds.height})
-- naechster VERSCUH
-- https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/
-- https://stackoverflow.com/questions/35720323/scrapyjs-splash-click-controller-button
-- assert(splash:runjs("$('#date-controller > a:first-child').click()"))
-- https://github.com/scrapy-plugins/scrapy-splash/issues/27
-- assert(splash:runjs("$('#result-pager-next-active .simplebutton').click()"))
-- https://developer.mozilla.org/de/docs/Web/API/Document/querySelectorAll
-- ANSCHAUEN
-- https://stackoverflow.com/questions/38043672/splash-lua-script-to-do-multiple-clicks-and-visits
-- elementList = baseElement.querySelectorAll(selectors)
-- var domRect = element.getBoundingClientRect();
-- var rect = obj.getBoundingClientRect();
-- https://stackoverflow.com/questions/34001917/queryselectorall-with-multiple-conditions
local get_dimensions = splash:jsfunc([[
function () {
var doc1 = document.querySelectorAll("result-pager-next-active.simplebutton")[0];
var el = doc1.documentElement;
var rect = el.getClientRects()[0];
return {'x': rect.left, 'y': rect.top}
}
]])
-- splash:set_viewport_full()
splash:wait(0.1)
local dimensions = get_dimensions()
splash:mouse_click(dimensions.x, dimensions.y)
-- splash:runjs("document.querySelectorAll('result-pager-next-active ,simplebutton')[1].click()")
-- assert(splash:runjs("$('.result-pager-next-active .simplebutton')[1].click()"))
-- assert(splash:runjs("$('.simplebutton')[12].click()"))
splash:wait(1.6)
return splash:html()
end
"""
for teil in response.xpath('//div/div/div/div/a'):
yield {
'link': teil.xpath('./@href').extract()
}
next_page = response.xpath('//div[@class="v-label v-widget simplebutton v-label-simplebutton v-label-undef-w"]').extract_first()
# print response.body
print '----------------------'
# print response.xpath('//div[@class="v-slot v-slot-simplebutton"]/div[contains(text(), "Weiter")]').extract_first()
# print response.xpath('//div[@class="v-slot v-slot-simplebutton"]/div[contains(text(), "Weiter")]').extract()
# class="v-slot v-slot-simplebutton"
# nextPage = HtmlXPathSelector(response).select("//div[@class='paginationControl']/a[contains(text(),'Link Text Next')]/@href")
# neue_seite=response.url
# print response.url
if next_page is not None:
# yield SplashRequest(url=neue_seite,
yield SplashRequest(response.url,
callback=self.parse,
dont_filter=True,
endpoint='execute',
args={'lua_source': script3})
python大神给出的解决方案
您不必总是使用Splash。如果下一个按钮是链接,则只需获取href属性并将请求发送回解析函数即可。
在Flask中测试文件上传 - python我在Flask集成测试中使用Flask-Testing。我有一个表单,该表单具有我要为其编写测试的徽标的文件上传,但是我不断收到错误消息:TypeError: 'str' does not support the buffer interface。我正在使用Python3。我找到的最接近的答案是this,但是它对我不起作用。这是我的许多尝…
在熊猫中,如何从单词列表或单词集中选择数据框中的短语? - python在Python3和熊猫中,我具有数据框:df_projetos_api_final.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 93631 entries, 1 to 93667 Data columns (total 21 columns): AnoMat…
如果__name__ =='__main__',则为Python的Powershell等效项: - python我真的很喜欢python的功能,例如:if __name__ == '__main__': #setup testing code here #or setup a call a function with parameters and human format the output #etc... 很好,因为我可以将Python脚本文件…
对于DataFrame的每一行,在给定条件的情况下获取第一列的索引到新列中 - python这是我的数据框的一部分。data = [ ['1245', np.nan, np.nan, 1.0, 1.0, ''], ['1246', np.nan, 1.0, 1.0, 1.0, ''], ['1247', 1.0, 1.0, 1.0, 1.0, …
如何使用Python向Viber机器人发送消息? - python我有以下HTTPS服务器:from flask import Flask, request, Response from viberbot import Api from viberbot.api.bot_configuration import BotConfiguration from viberbot.api.messages import Video…