从交互式图形中收集数据 - python

我有一个website,其中包含几个交互式图表,我想从中提取数据。我之前使用selenium webdriver在python中编写了一些Web抓取工具,但这似乎是一个不同的问题。我已经看过关于stackoverflow的几个类似问题。从这些看来,解决方案可能是直接从json文件下载数据。我查看了网站的源代码,并确定了几个json文件,但是经检查,它们似乎并不包含这些数据。

有人知道如何从这些图表下载数据吗?我尤其对以下条形图感兴趣:.//*[@id='network_download']

谢谢

编辑:我应该补充一点,当我使用Firebug检查网站时,我看到有可能以以下格式获取数据。但这显然没有帮助,因为它不包含任何标签。

<circle fill="#8CB1AA" cx="713.4318516666667" cy="5.357142857142858" r="4.5" style="opacity: 0.983087;">
<circle fill="#8CB1AA" cx="694.1212663333334" cy="10.714285714285715" r="4.5" style="opacity: 0.983087;">
<circle fill="#CEA379" cx="626.4726493333333" cy="16.071428571428573" r="4.5" style="opacity: 0.983087;">
<circle fill="#B0B359" cx="613.88416" cy="21.42857142857143" r="4.5" style="opacity: 0.983087;">
<circle fill="#D1D49E" cx="602.917665" cy="26.785714285714285" r="4.5" style="opacity: 0.983087;">
<circle fill="#A5E0B5" cx="581.5437366666666" cy="32.142857142857146" r="4.5" style="opacity: 0.983087;">

参考方案

这样的SVG图表很难抓取。直到实际用鼠标悬停各个元素时,所需的数字才会显示。

要获取数据,您需要

查找所有点的列表
对于dots_list中的每个点,单击或悬停(动作链)点
刮掉弹出的工具提示中的值

这对我有用:

from __future__ import print_function

from pprint import pprint as pp

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains


def main():
    driver = webdriver.Chrome()
    ac = ActionChains(driver)

    try:
        driver.get("https://opensignal.com/reports/2016/02/state-of-lte-q4-2015/")

        dots_css = "div#network_download g g.dots_container circle"
        dots_list = driver.find_elements_by_css_selector(dots_css)

        print("Found {0} data points".format(len(dots_list)))

        download_speeds = list()
        for index, _ in enumerate(dots_list, 1):
            # Because this is an SVG chart, and because we need to hover it,
            # it is very likely that the elements will go stale as we do this. For
            # that reason we need to require each dot element right before we click it
            single_dot_css = dots_css + ":nth-child({0})".format(index)
            dot = driver.find_element_by_css_selector(single_dot_css)
            dot.click()

            # Scrape the text from the popup
            popup_css = "div#network_download div.tooltip"
            popup_text = driver.find_element_by_css_selector(popup_css).text
            pp(popup_text)
            rank, comp_and_country, speed = popup_text.split("\n")
            company, country = comp_and_country.split(" in ")
            speed_dict = {
                "rank": rank.split(" Globally")[0].strip("#"),
                "company": company,
                "country": country,
                "speed": speed.split("Download speed: ")[1]
            }
            download_speeds.append(speed_dict)

            # Hover away from the tool tip so it clears
            hover_elem = driver.find_element_by_id("network_download")
            ac.move_to_element(hover_elem).perform()

        pp(download_speeds)

    finally:
        driver.quit()

if __name__ == "__main__":
    main()

样本输出:

(.venv35) ➜  stackoverflow python svg_charts.py
Found 182 data points
'#1 Globally\nSingTel in Singapore\nDownload speed: 40 Mbps'
'#2 Globally\nStarHub in Singapore\nDownload speed: 39 Mbps'
'#3 Globally\nSaskTel in Canada\nDownload speed: 35 Mbps'
'#4 Globally\nOrange in Israel\nDownload speed: 35 Mbps'
'#5 Globally\nolleh in South Korea\nDownload speed: 34 Mbps'
'#6 Globally\nVodafone in Romania\nDownload speed: 33 Mbps'
'#7 Globally\nVodafone in New Zealand\nDownload speed: 32 Mbps'
'#8 Globally\nTDC in Denmark\nDownload speed: 31 Mbps'
'#9 Globally\nT-Mobile in Hungary\nDownload speed: 30 Mbps'
'#10 Globally\nT-Mobile in Netherlands\nDownload speed: 30 Mbps'
'#11 Globally\nM1 in Singapore\nDownload speed: 29 Mbps'
'#12 Globally\nTelstra in Australia\nDownload speed: 29 Mbps'
'#13 Globally\nTelenor in Hungary\nDownload speed: 29 Mbps'
<...>
[{'company': 'SingTel',
  'country': 'Singapore',
  'rank': '1',
  'speed': '40 Mbps'},
 {'company': 'StarHub',
  'country': 'Singapore',
  'rank': '2',
  'speed': '39 Mbps'},
 {'company': 'SaskTel', 'country': 'Canada', 'rank': '3', 'speed': '35 Mbps'}
...
]

应该注意的是,您在问题中所引用的圆圈元素中的值并不是特别有用,因为这些值仅指定如何在SVG图表中绘制点。

用大写字母拆分字符串,但忽略AAA Python Regex - python

我的正则表达式:vendor = "MyNameIsJoe. I'mWorkerInAAAinc." ven = re.split(r'(?<=[a-z])[A-Z]|[A-Z](?=[a-z])', vendor) 以大写字母分割字符串,例如:'我的名字是乔。 I'mWorkerInAAAinc”变成…

查找字符串中的行数 - python

我正在创建一个python电影播放器​​/制作器,我想在多行字符串中找到行数。我想知道是否有任何内置函数或可以编写代码的函数来做到这一点:x = """ line1 line2 """ getLines(x) python大神给出的解决方案 如果换行符是'\n',则nlines …

字符串文字中的正斜杠表现异常 - python

为什么S1和S2在撇号位置方面表现不同?S1="1/282/03/10" S2="4/107/03/10" R1="".join({"N\'" ,S1,"\'" }) R2="".join({"N\'…

Python中的Json操作 - python

Latest_json和Historic_json函数返回:return(frame.to_json(orient='records')) 主功能:recentdata = recent_json(station) historicdata = historic_json(station) alldata = historicdata +…

将pandas数据框转换为唯一元组列表 - python

将熊猫数据框转换为唯一元组列表的最有效方法是什么?在下面的代码中,我试图提取包含所有唯一PostalCode和Age的元组列表。from typing import NamedTuple, Sequence, Tuple import pandas as pd data = [["tom", 10, "ab 11"],…