从使用Python使用AJAX分页的网站上使用BeautifulSoup进行抓取 - python

我对编码和Python还是很陌生，所以如果这是一个愚蠢的问题，我深表歉意。我想要一个脚本，该脚本遍历所有19,000个搜索结果页面，并对所有URL都刮每个页面。我已经完成了所有的剪贴工作，但无法弄清楚该页面如何使用AJAX进行分页的事实。通常，我只会使用url循环以捕获每个搜索结果，但这是不可能的。这是页面:http://www.heritage.org/research/all-research.aspx?nomobile&categories=report

这是我到目前为止的脚本:

with io.open('heritageURLs.txt', 'a', encoding='utf8') as logfile:
    page = urllib2.urlopen("http://www.heritage.org/research/all-research.aspx?nomobile&categories=report")
    soup = BeautifulSoup(page)
    snippet = soup.find_all('a', attrs={'item-title'})
    for a in snippet:
        logfile.write ("http://www.heritage.org" + a.get('href') + "\n")

print "Done collecting urls"

显然，它只刮取结果的第一页，仅此而已。

我已经看过一些相关的问题，但似乎没有一个使用Python，或者至少没有以我能理解的方式使用。预先感谢您的帮助。

python大神给出的解决方案

为了完整起见，虽然您可以尝试访问POST请求并找到一种访问下一页的方法，就像我在我的评论中所建议的那样，但是如果可以替代的话，使用Selenium将很容易实现您想要的目标。

这是一个使用硒的简单解决方案:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep

# uncomment if using Firefox web browser
driver = webdriver.Firefox()

# uncomment if using Phantomjs
#driver = webdriver.PhantomJS()

url = 'http://www.heritage.org/research/all-research.aspx?nomobile&categories=report'
driver.get(url)

# set initial page count
pages = 1
with open('heritageURLs.txt', 'w') as f:
    while True:
        try:
            # sleep here to allow time for page load
            sleep(5)
            # grab the Next button if it exists
            btn_next = driver.find_element_by_class_name('next')
            # find all item-title a href and write to file
            links = driver.find_elements_by_class_name('item-title')
            print "Page: {} -- {} urls to write...".format(pages, len(links))
            for link in links:
                f.write(link.get_attribute('href')+'\n')
            # Exit if no more Next button is found, ie. last page
            if btn_next is None:
                print "crawling completed."
                exit(-1)
            # otherwise click the Next button and repeat crawling the urls
            pages += 1
            btn_next.send_keys(Keys.RETURN)
        # you should specify the exception here
        except:
            print "Error found, crawling stopped"
            exit(-1)

希望这可以帮助。

腾讯的同事天天给我安利让我看《三体》，说马化腾和雷军也在…

腾讯的同事天天给我安利让我看《三体》，说马化腾和雷军也在看。自己强行看了两个月，全部给看完了。感觉这文笔也就我读初中的水平……而且写着国内的一些情况，外国人能理解吗？这书为什么会这么火？这水平我也可以去写呀[笑哭][笑哭][笑哭] 招商银行员工：可以写赶紧写一个啊，能拿科幻文学雨果奖。包清白：哦楼主：pei ！tui ！你也配姓龙楼主：@赵龙王呵呵 […]