确定网站是否为网上商店 - python

我正在尝试从网站列表中确定一个网站是否是一个网上商店。

似乎大多数网上商店都具有：

在其a中带有单词“ cart”的href标记
在类名称中分配了带有单词“ cart”的类的li标记

我相信我将不得不利用正则表达式，然后告诉BeautifulSoup find方法在a或li标记中搜索HTML数据以查找此正则表达式。我应该怎么做？

到目前为止，下面的代码在HTML数据中搜索具有a完全相同购物车的href标记。

码

import re
from bs4 import BeautifulSoup
from selenium import webdriver

websites = [
    'https://www.nike.com/',
    'https://www.youtube.com/',
    'https://www.google.com/',
    'https://www.amazon.com/',
    'https://www.gamestop.com/'
]
shops = []

driver = webdriver.Chrome('chromedriver')
options = webdriver.ChromeOptions()
options.headless = True
options.add_argument('log-level=3')

with webdriver.Chrome(options=options) as driver:
    for url in websites:
        driver.get(url)
        cart = re.compile('.*cart.*', re.IGNORECASE)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        if soup.find('a', href=cart):
            shops.append(url)

print('\nSHOPS FOUND:')
for shop in shops:
    print(shop)

输出：

SHOPS FOUND:
https://www.nike.com/
https://www.amazon.com/

参考方案

您可以将contains *运算符与CSS属性选择器一起使用，以指定类属性或href属性包含子字符串cart。将两个（类和href）与Or语法结合使用。待办事项：您可能考虑添加等待条件，以确保首先显示所有li和a标记元素。

from bs4 import BeautifulSoup
from selenium import webdriver

websites = [
    'https://www.nike.com/',
    'https://www.youtube.com/',
    'https://www.google.com/',
    'https://www.amazon.com/',
    'https://www.gamestop.com/'
]
shops = []

driver = webdriver.Chrome('chromedriver')
options = webdriver.ChromeOptions()
options.headless = True
options.add_argument('log-level=3')

with webdriver.Chrome(options=options) as driver:
    for url in websites:
        driver.get(url)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        items = soup.select('a[href*=cart], li[class*=cart]')
        if len(items) > 0:
                shops.append(url)
print('\nSHOPS FOUND:')
for shop in shops:
    print(shop)

R'relaimpo'软件包的Python端口 - python

我需要计算Lindeman-Merenda-Gold（LMG）分数，以进行回归分析。我发现R语言的relaimpo包下有该文件。不幸的是，我对R没有任何经验。我检查了互联网，但找不到。这个程序包有python端口吗？如果不存在，是否可以通过python使用该包？ python参考方案最近，我遇到了pingouin库。

Python:传递记录器是个好主意吗？ - python

我的Web服务器的API日志如下：started started succeeded failed 那是同时收到的两个请求。很难说哪一个成功或失败。为了彼此分离请求，我为每个请求创建了一个随机数，并将其用作记录器的名称logger = logging.getLogger(random_number) 日志变成[111] started [222] start…

Python-Excel导出 - python

我有以下代码：import pandas as pd import requests from bs4 import BeautifulSoup res = requests.get("https://www.bankier.pl/gielda/notowania/akcje") soup = BeautifulSoup(res.cont…

Matplotlib'粗体'字体 - python

跟随this example：import numpy as np import matplotlib.pyplot as plt fig = plt.figure() for i, label in enumerate(('A', 'B', 'C', 'D')): ax = f…

Python:如何根据另一列元素明智地查找一列中的空单元格计数？ - python

df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice','Jane', 'Alice','Bob', 'Alice'], 'income…

腾讯的同事天天给我安利让我看《三体》，说马化腾和雷军也在…

腾讯的同事天天给我安利让我看《三体》，说马化腾和雷军也在看。自己强行看了两个月，全部给看完了。感觉这文笔也就我读初中的水平……而且写着国内的一些情况，外国人能理解吗？这书为什么会这么火？这水平我也可以去写呀[笑哭][笑哭][笑哭] 招商银行员工：可以写赶紧写一个啊，能拿科幻文学雨果奖。包清白：哦楼主：pei ！tui ！你也配姓龙楼主：@赵龙王呵呵 […]