两个分隔符之间的Python line.split - python

我有一个包含以下数据的文本文件:

Schema:
  Column Name                   Localized Name                Type    MaxLength
  ----------------------------  ----------------------------  ------  ---------
  Raw                Binary            Binary  16384

Row 1:
  Binary:
-----BEGIN-----
fdsfdsfdasadsad
fsdfafsdafsadfa
fsdafadsfadsfdsa
-----END-----


Row 2:
  Binary:
-----BEGIN-----
fsdfdssd
fdsfadsfasd
fsdafdsa 
-----END-----


Row 3:
  Binary:
-----BEGIN-----
fsdafadsds
fsdafasdsda
fdsafadssad
-----END-----

我需要将“ ----- BEGIN -----”和“ ------ END -----”定界符之间的数据提取到数组中。

这是我尝试过的:

data = open("test_data.txt", 'r')
result = [line.split('-----BEGIN-----') for line in data.readlines()]
print data

但是,这显然会在“ ----- BEGIN -----”定界符之后获取所有数据。

如何添加末端分度符?

注意文件很大,大约1GB。

python大神给出的解决方案

对于之间的多行,您希望将数据分成几部分,只需捕获以----- BEGIN- ..开头的每个块,并不断添加行,直到达到END:

with open("file.txt") as f:
    out = []
    for line in f:
        if line.rstrip() == "-----BEGIN-----":
            tmp = []
            for line in f:
                if line.rstrip() == "-----END-----":
                    out.append(tmp)
                    break
                tmp.append(line)

这些部分将被分成子列表:

 [['fdsfdsfdasadsad\n', 'fsdfafsdafsadfa\n', 'fsdafadsfadsfdsa\n'],   ['fsdfdssd\n', 'fdsfadsfasd\n', 'fsdafdsa \n'], ['fsdafadsds\n', 'fsdafasdsda\n', 'fdsafadssad\n']]

使用with打开文件,除非需要列表,否则不要调用readlines,可以按上述方式遍历文件对象,而无需将所有内容存储在内存中。

或使用itertools.takewhile来获取这些部分:

from itertools import takewhile, imap
with open("file.txt") as f:
    f = imap(str.rstrip,f) # use map for python3
    out = [list(takewhile(lambda x: x != "-----END-----",f)) for line in f if line == "-----BEGIN-----"]
    print(out)

[['fdsfdsfdasadsad', 'fsdfafsdafsadfa', 'fsdafadsfadsfdsa'], 
['fsdfdssd', 'fdsfadsfasd', 'fsdafdsa'], 
['fsdafadsds', 'fsdafasdsda', 'fdsafadssad']]

如果只需要列出所有单词,则可以链接:

from itertools import takewhile,chain, imap
with open("file.txt") as f:
    f = imap(str.rstrip,f)
    out = chain.from_iterable(takewhile(lambda x: x != "-----END-----",f) for line in f if line == "-----BEGIN-----")
    print(list(out))

['fdsfdsfdasadsad', 'fsdfafsdafsadfa', 'fsdafadsfadsfdsa',
 'fsdfdssd', 'fdsfadsfasd', 'fsdafdsa', 'fsdafadsds', 'fsdafasdsda', 'fdsafadssad']

文件对象返回其自己的迭代器,因此每次迭代或调用takewhile消耗行时,takewhile都会继续获取行,直到我们点击-----END----为止,然后继续迭代直到遇到另一条-----BEGIN-----行,如果这些行始终以<开头cc>并且没有其他行可以执行,那么您只需检查该条件即-if line[0] == "-"而不是检查整个行。

如果要处理每个部分,则可以使用生成器表达式并处理每个部分的行:

with open("file.txt") as f:
    f = imap(str.rstrip,f)
    out = ((takewhile(lambda x: x != "-----END-----",f)) for line in f if line == "-----BEGIN-----")
    for sec in out:
        print(list(sec))

['fdsfdsfdasadsad', 'fsdfafsdafsadfa', 'fsdafadsfadsfdsa']
['fsdfdssd', 'fdsfadsfasd', 'fsdafdsa']
['fsdafadsds', 'fsdafasdsda', 'fdsafadssad']

如果要单个字符串调用联接:

with open("file.txt") as f:
    f = imap(str.rstrip,f)
    st, end = "-----BEGIN-----", "-----END-----"
    out = "".join(chain.from_iterable(takewhile(lambda x: x != end,f)
                                      for line in f if line == st))
    print(out)

输出:

fdsfdsfdasadsadfsdfafsdafsadfafsdafadsfadsfdsafsdfdssdfdsfadsfasdfsdafdsafsdafadsdsfsdafasdsdafdsafadssad

要获取保留x[0] != "-"-----BEGIN-----的单个字符串

with open("out.txt") as f:
    f = imap(str.rstrip,f)
    st, end = "-----BEGIN-----", "-----END-----"
    out = "".join(["{}{}{}".format(st, "".join(takewhile(lambda x: x != end, f)), end)
                                    for line in f if line == st])

输出:

-----BEGIN-----fdsfdsfdasadsadfsdfafsdafsadfafsdafadsfadsfdsa-----END----------BEGIN-----fsdfdssdfdsfadsfasdfsdafdsa-----END----------BEGIN-----fsdafadsdsfsdafasdsdafdsafadssad-----END-----