如何遍历表中的HTML链接以从表中提取数据？ - python

我正在尝试浏览https://bgp.he.net/report/world处的表格。我想遍历到国家页面的每个HTML链接，然后获取数据，然后迭代到下一个列表。我正在使用漂亮的汤，并且已经可以获取想要的数据，但是还不太清楚如何遍历HTML列。

from bs4 import BeautifulSoup
import requests
import json


headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0'}

url = "https://bgp.he.net/country/LC"
html = requests.get(url, headers=headers)

country_ID = (url[-2:])
print("\n")

soup = BeautifulSoup(html.text, 'html.parser')
#print(soup)
data = []
for row in soup.find_all("tr")[1:]: # start from second row
    cells = row.find_all('td')
    data.append({
        'ASN': cells[0].text,
        'Country': country_ID,
        "Name": cells[1].text,
        "Routes V4": cells[3].text,
        "Routes V6": cells[5].text
    })



i = 0

with open ('table_attempt.txt', 'w') as r:
    for item in data:
        r.write(str(data[i]))
        i += 1
        r.write("\n")


print(data)

我希望能够将每个国家/地区的数据收集到一个书面文件中。

参考方案

我仅使用前3个链接进行了测试（使用UnicodeEncodeError遇到了一个错误，但已将其修复并注释了代码中的位置）。

from bs4 import BeautifulSoup
import requests
import json

#First get the list of countries urls

headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0'}

url = "https://bgp.he.net/report/world"
html = requests.get(url, headers=headers)

soup = BeautifulSoup(html.text, 'html.parser')

table = soup.find('table', {'id':'table_countries'})
rows = table.find_all('tr')

country_urls = []

# Go through each row and grab the link. If there's no link, continue to next row
for row in rows:
    try:
        link = row.select('a')[0]['href']
        country_urls.append(link)
    except:
        continue


# Now iterate through that list
for link in country_urls:

    url = "https://bgp.he.net" + link
    html = requests.get(url, headers=headers)

    country_ID = (url[-2:])
    print("\n")

    soup = BeautifulSoup(html.text, 'html.parser')
    #print(soup)
    data = []
    for row in soup.find_all("tr")[1:]: # start from second row
        cells = row.find_all('td')
        data.append({
            'ASN': cells[0].text,
            'Country': country_ID,
            "Name": cells[1].text,
            "Routes V4": cells[3].text,
            "Routes V6": cells[5].text
        })



    i = 0
    print ('Writing from %s' %(url))

    # I added encoding="utf-8" because of an UnicodeEncodeError:
    with open ('table_attempt.txt', 'w', encoding="utf-8") as r:
        for item in data:
            r.write(str(data[i]))
            i += 1
            r.write("\n")

在返回'Response'(Python)中传递多个参数 - python

我在Angular工作，正在使用Http请求和响应。是否可以在“响应”中发送多个参数。角度文件：this.http.get("api/agent/applicationaware").subscribe((data:any)... python文件：def get(request): ... return Response(seriali…

用大写字母拆分字符串，但忽略AAA Python Regex - python

我的正则表达式：vendor = "MyNameIsJoe. I'mWorkerInAAAinc." ven = re.split(r'(?<=[a-z])[A-Z]|[A-Z](?=[a-z])', vendor) 以大写字母分割字符串，例如：'我的名字是乔。 I'mWorkerInAAAinc”变成…

您如何在列表内部调用一个字符串位置？ - python

我一直在做迷宫游戏。我首先决定制作一个迷你教程。游戏开发才刚刚开始，现在我正在尝试使其向上发展。我正在尝试更改PlayerAre变量，但是它不起作用。我试过放在列表内和列表外。maze = ["o","*","*","*","*","*",…

Python exchangelib在子文件夹中读取邮件 - python

我想从Outlook邮箱的子文件夹中读取邮件。Inbox ├──myfolder 我可以使用account.inbox.all()阅读收件箱，但我想阅读myfolder中的邮件我尝试了此页面folder部分中的内容，但无法正确完成https://pypi.python.org/pypi/exchangelib/ 参考方案您需要首先掌握Folder的myfo…

R'relaimpo'软件包的Python端口 - python

我需要计算Lindeman-Merenda-Gold（LMG）分数，以进行回归分析。我发现R语言的relaimpo包下有该文件。不幸的是，我对R没有任何经验。我检查了互联网，但找不到。这个程序包有python端口吗？如果不存在，是否可以通过python使用该包？ python参考方案最近，我遇到了pingouin库。

如何遍历表中的HTML链接以从表中提取数据？ - python

腾讯的同事天天给我安利让我看《三体》，说马化腾和雷军也在…