将Regex与BeautifulSoup结合使用可在Python中解析字符串 - python

我有一系列类似于“ 2014年12月27日星期六”的字符串，我想扔掉“星期六”并保存名称为“ 141227”的文件，即年+月+日。到目前为止，一切工作正常，但我无法使Daypos或Yearpos的正则表达式正常工作。它们都给出相同的错误:

追溯(最近一次通话):文件“ scrapewaybackblog.py”，行
17，在
daypos = byline.find(re.compile(“ [A-Z] [a-z] * \ s”))TypeError:预期为字符缓冲区对象

什么是字符缓冲区对象？那是否表示我的表情有问题？这是我的脚本:

for i in xrange(3, 1, -1):
       page = urllib2.urlopen("http://web.archive.org/web/20090204221349/http://www.americansforprosperity.org/nationalblog?page={}".format(i))
       soup = BeautifulSoup(page.read())
       snippet = soup.find_all('div', attrs={'class': 'blog-box'})
       for div in snippet:
           byline =  div.find('div', attrs={'class': 'date'}).text.encode('utf-8')
           text = div.find('div', attrs={'class': 'right-box'}).text.encode('utf-8')

           monthpos = byline.find(",")
           daypos = byline.find(re.compile("[A-Z][a-z]*\s"))
           yearpos = byline.find(re.compile("[A-Z][a-z]*\D\d*\w*\s"))
           endpos = monthpos + len(byline)

           month = byline[monthpos+1:daypos]
           day = byline[daypos+0:yearpos]
           year = byline[yearpos+2:endpos]

           output_files_pathname = 'Data/'  # path where output will go
           new_filename = year + month + day + ".txt"
           outfile = open(output_files_pathname + new_filename,'w')
           outfile.write(date)
           outfile.write("\n")
           outfile.write(text)
           outfile.close()
       print "finished another url from page {}".format(i)

我还没有弄清楚如何使12月= 12，但这是另一次。请帮我找到合适的位置。

python大神给出的解决方案

与其使用正则表达式解析日期字符串，不如使用dateutil解析日期字符串:

from dateutil.parser import parse

for div in soup.select('div.blog-box'):
    byline = div.find('div', attrs={'class': 'date'}).text.encode('utf-8')
    text = div.find('div', attrs={'class': 'right-box'}).text.encode('utf-8')

    dt = parse(byline)
    new_filename = "{dt.year}{dt.month}{dt.day}.txt".format(dt=dt)
    ...

或者，您可以使用datetime.strptime()解析字符串，但是需要注意suffixes:

byline = re.sub(r"(?<=\d)(st|nd|rd|th)", "", byline)
dt = datetime.strptime(byline, '%A, %B %d %Y')

re.sub()在这里找到st或nd或rd或th字符串after a digit，并将后缀替换为空字符串。之后，日期字符串将匹配'%A, %B %d %Y'格式，请参见:

strftime() and strptime() Behavior

一些附加说明:

您可以将urlopen()的结果直接传递给BeautifulSoup构造函数
代替按类名的find_all()，使用CSS Selector div.blog-box
要加入系统路径，请使用os.path.join()
处理文件时使用with context manager

固定版本:

import os
import urllib2

from bs4 import BeautifulSoup
from dateutil.parser import parse


for i in xrange(3, 1, -1):
    page = urllib2.urlopen("http://web.archive.org/web/20090204221349/http://www.americansforprosperity.org/nationalblog?page={}".format(i))
    soup = BeautifulSoup(page)

    for div in soup.select('div.blog-box'):
        byline = div.find('div', attrs={'class': 'date'}).text.encode('utf-8')
        text = div.find('div', attrs={'class': 'right-box'}).text.encode('utf-8')

        dt = parse(byline)

        new_filename = "{dt.year}{dt.month}{dt.day}.txt".format(dt=dt)
        with open(os.path.join('Data', new_filename), 'w') as outfile:
            outfile.write(byline)
            outfile.write("\n")
            outfile.write(text)

    print "finished another url from page {}".format(i)

腾讯的同事天天给我安利让我看《三体》，说马化腾和雷军也在…

腾讯的同事天天给我安利让我看《三体》，说马化腾和雷军也在看。自己强行看了两个月，全部给看完了。感觉这文笔也就我读初中的水平……而且写着国内的一些情况，外国人能理解吗？这书为什么会这么火？这水平我也可以去写呀[笑哭][笑哭][笑哭] 招商银行员工：可以写赶紧写一个啊，能拿科幻文学雨果奖。包清白：哦楼主：pei ！tui ！你也配姓龙楼主：@赵龙王呵呵 […]