对大型XML文件使用Python Iterparse - python

我需要用Python编写一个解析器，该解析器可以在没有太多内存(只有2 GB)的计算机上处理一些非常大的文件(> 2 GB)。我想在lxml中使用iterparse做到这一点。

我的文件格式为:

<item>
  <title>Item 1</title>
  <desc>Description 1</desc>
</item>
<item>
  <title>Item 2</title>
  <desc>Description 2</desc>
</item>

到目前为止，我的解决方案是:

from lxml import etree

context = etree.iterparse( MYFILE, tag='item' )

for event, elem in context :
      print elem.xpath( 'description/text( )' )

del context

但是，不幸的是，此解决方案仍在消耗大量内存。我认为问题在于，在与每个“ITEM”打交道之后，我需要做一些清理空孩子的事情。在处理完数据以进行适当清理之后，谁能提出一些建议以解决我的问题？

参考方案

尝试Liza Daly's fast_iter。在处理了elem元素之后，它调用elem.clear()除去后代，也除去前面的 sibling 。

def fast_iter(context, func, *args, **kwargs):
    """
    http://lxml.de/parsing.html#modifying-the-tree
    Based on Liza Daly's fast_iter
    http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    See also http://effbot.org/zone/element-iterparse.htm
    """
    for event, elem in context:
        func(elem, *args, **kwargs)
        # It's safe to call clear() here because no descendants will be
        # accessed
        elem.clear()
        # Also eliminate now-empty references from the root node to elem
        for ancestor in elem.xpath('ancestor-or-self::*'):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
    del context


def process_element(elem):
    print elem.xpath( 'description/text( )' )

context = etree.iterparse( MYFILE, tag='item' )
fast_iter(context,process_element)

Daly的文章非常不错，特别是在处理大型XML文件时。

编辑:上面发布的fast_iter是Daly的fast_iter的修改版本。处理完一个元素后，它会更主动地删除不再需要的其他元素。

下面的脚本显示了行为上的差异。特别要注意的是，orig_fast_iter不会删除A1元素，而mod_fast_iter确实会删除它，从而节省了更多内存。

import lxml.etree as ET
import textwrap
import io

def setup_ABC():
    content = textwrap.dedent('''\
      <root>
        <A1>
          <B1></B1>
          <C>1<D1></D1></C>
          <E1></E1>
        </A1>
        <A2>
          <B2></B2>
          <C>2<D></D></C>
          <E2></E2>
        </A2>
      </root>
        ''')
    return content


def study_fast_iter():
    def orig_fast_iter(context, func, *args, **kwargs):
        for event, elem in context:
            print('Processing {e}'.format(e=ET.tostring(elem)))
            func(elem, *args, **kwargs)
            print('Clearing {e}'.format(e=ET.tostring(elem)))
            elem.clear()
            while elem.getprevious() is not None:
                print('Deleting {p}'.format(
                    p=(elem.getparent()[0]).tag))
                del elem.getparent()[0]
        del context

    def mod_fast_iter(context, func, *args, **kwargs):
        """
        http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
        Author: Liza Daly
        See also http://effbot.org/zone/element-iterparse.htm
        """
        for event, elem in context:
            print('Processing {e}'.format(e=ET.tostring(elem)))
            func(elem, *args, **kwargs)
            # It's safe to call clear() here because no descendants will be
            # accessed
            print('Clearing {e}'.format(e=ET.tostring(elem)))
            elem.clear()
            # Also eliminate now-empty references from the root node to elem
            for ancestor in elem.xpath('ancestor-or-self::*'):
                print('Checking ancestor: {a}'.format(a=ancestor.tag))
                while ancestor.getprevious() is not None:
                    print(
                        'Deleting {p}'.format(p=(ancestor.getparent()[0]).tag))
                    del ancestor.getparent()[0]
        del context

    content = setup_ABC()
    context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')
    orig_fast_iter(context, lambda elem: None)
    # Processing <C>1<D1/></C>
    # Clearing <C>1<D1/></C>
    # Deleting B1
    # Processing <C>2<D/></C>
    # Clearing <C>2<D/></C>
    # Deleting B2

    print('-' * 80)
    """
    The improved fast_iter deletes A1. The original fast_iter does not.
    """
    content = setup_ABC()
    context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')
    mod_fast_iter(context, lambda elem: None)
    # Processing <C>1<D1/></C>
    # Clearing <C>1<D1/></C>
    # Checking ancestor: root
    # Checking ancestor: A1
    # Checking ancestor: C
    # Deleting B1
    # Processing <C>2<D/></C>
    # Clearing <C>2<D/></C>
    # Checking ancestor: root
    # Checking ancestor: A2
    # Deleting A1
    # Checking ancestor: C
    # Deleting B2

study_fast_iter()

如何使用BeautifulSoup在<tr>中捕获特定的<td> - python

尝试从nyc Wiki页面中的高中列表中获取所有高中名称。我已经写了足够多的脚本，可以让我获取包含在高中，学业和入学条件列表的表的<tr>标记中的所有信息-但是我如何才能缩小到我认为的范围内在td[0]内休息（会弹出KeyError）-只是学校的名称？到目前为止我写的代码：from bs4 import BeautifulSoup from ur…

Python numpy数据指针地址无需更改即可更改 - python

编辑经过一些摆弄之后，到目前为止，我已经隔离了以下状态：一维数组在直接输入变量时提供两个不同的地址，而在使用print()时仅提供一个地址2D数组（或矩阵）在直接输入变量时提供三个不同的地址，在使用print()时提供两个地址3D数组在直接输入变量时提供两个不同的地址，而在使用print()时仅给出一个（显然与一维数组相同）像这样：>>> …

Python Pandas导出数据 - python

我正在使用python pandas处理一些数据。我已使用以下代码将数据导出到excel文件。writer = pd.ExcelWriter('Data.xlsx'); wrong_data.to_excel(writer,"Names which are wrong", index = False); writer.…

Python pytz时区函数返回的时区为9分钟 - python

由于某些原因，我无法从以下代码中找出原因：>>> from pytz import timezone >>> timezone('America/Chicago') 我得到：<DstTzInfo 'America/Chicago' LMT-1 day, 18:09:00 STD…

Python 3运算符>>打印到文件 - python

我有以下Python代码编写项目的依赖文件。它可以在Python 2.x上正常工作，但是在使用Python 3进行测试时会报告错误。depend = None if not nmake: depend = open(".depend", "a") dependmak = open(".depend.mak&#…

对大型XML文件使用Python Iterparse - python

腾讯的同事天天给我安利让我看《三体》，说马化腾和雷军也在…