我有一个包含以下数据的文本文件:
Schema:
Column Name Localized Name Type MaxLength
---------------------------- ---------------------------- ------ ---------
Raw Binary Binary 16384
Row 1:
Binary:
-----BEGIN-----
fdsfdsfdasadsad
fsdfafsdafsadfa
fsdafadsfadsfdsa
-----END-----
Row 2:
Binary:
-----BEGIN-----
fsdfdssd
fdsfadsfasd
fsdafdsa
-----END-----
Row 3:
Binary:
-----BEGIN-----
fsdafadsds
fsdafasdsda
fdsafadssad
-----END-----
我需要将“ ----- BEGIN -----”和“ ------ END -----”定界符之间的数据提取到数组中。
这是我尝试过的:
data = open("test_data.txt", 'r')
result = [line.split('-----BEGIN-----') for line in data.readlines()]
print data
但是,这显然会在“ ----- BEGIN -----”定界符之后获取所有数据。
如何添加末端分度符?
注意文件很大,大约1GB。
python大神给出的解决方案
对于之间的多行,您希望将数据分成几部分,只需捕获以----- BEGIN- ..开头的每个块,并不断添加行,直到达到END
:
with open("file.txt") as f:
out = []
for line in f:
if line.rstrip() == "-----BEGIN-----":
tmp = []
for line in f:
if line.rstrip() == "-----END-----":
out.append(tmp)
break
tmp.append(line)
这些部分将被分成子列表:
[['fdsfdsfdasadsad\n', 'fsdfafsdafsadfa\n', 'fsdafadsfadsfdsa\n'], ['fsdfdssd\n', 'fdsfadsfasd\n', 'fsdafdsa \n'], ['fsdafadsds\n', 'fsdafasdsda\n', 'fdsafadssad\n']]
使用with
打开文件,除非需要列表,否则不要调用readlines,可以按上述方式遍历文件对象,而无需将所有内容存储在内存中。
或使用itertools.takewhile
来获取这些部分:
from itertools import takewhile, imap
with open("file.txt") as f:
f = imap(str.rstrip,f) # use map for python3
out = [list(takewhile(lambda x: x != "-----END-----",f)) for line in f if line == "-----BEGIN-----"]
print(out)
[['fdsfdsfdasadsad', 'fsdfafsdafsadfa', 'fsdafadsfadsfdsa'],
['fsdfdssd', 'fdsfadsfasd', 'fsdafdsa'],
['fsdafadsds', 'fsdafasdsda', 'fdsafadssad']]
如果只需要列出所有单词,则可以链接:
from itertools import takewhile,chain, imap
with open("file.txt") as f:
f = imap(str.rstrip,f)
out = chain.from_iterable(takewhile(lambda x: x != "-----END-----",f) for line in f if line == "-----BEGIN-----")
print(list(out))
['fdsfdsfdasadsad', 'fsdfafsdafsadfa', 'fsdafadsfadsfdsa',
'fsdfdssd', 'fdsfadsfasd', 'fsdafdsa', 'fsdafadsds', 'fsdafasdsda', 'fdsafadssad']
文件对象返回其自己的迭代器,因此每次迭代或调用takewhile消耗行时,takewhile都会继续获取行,直到我们点击-----END----
为止,然后继续迭代直到遇到另一条-----BEGIN-----
行,如果这些行始终以<开头cc>并且没有其他行可以执行,那么您只需检查该条件即-
和if line[0] == "-"
而不是检查整个行。
如果要处理每个部分,则可以使用生成器表达式并处理每个部分的行:
with open("file.txt") as f:
f = imap(str.rstrip,f)
out = ((takewhile(lambda x: x != "-----END-----",f)) for line in f if line == "-----BEGIN-----")
for sec in out:
print(list(sec))
['fdsfdsfdasadsad', 'fsdfafsdafsadfa', 'fsdafadsfadsfdsa']
['fsdfdssd', 'fdsfadsfasd', 'fsdafdsa']
['fsdafadsds', 'fsdafasdsda', 'fdsafadssad']
如果要单个字符串调用联接:
with open("file.txt") as f:
f = imap(str.rstrip,f)
st, end = "-----BEGIN-----", "-----END-----"
out = "".join(chain.from_iterable(takewhile(lambda x: x != end,f)
for line in f if line == st))
print(out)
输出:
fdsfdsfdasadsadfsdfafsdafsadfafsdafadsfadsfdsafsdfdssdfdsfadsfasdfsdafdsafsdafadsdsfsdafasdsdafdsafadssad
要获取保留x[0] != "-"
和-----BEGIN-----
的单个字符串
with open("out.txt") as f:
f = imap(str.rstrip,f)
st, end = "-----BEGIN-----", "-----END-----"
out = "".join(["{}{}{}".format(st, "".join(takewhile(lambda x: x != end, f)), end)
for line in f if line == st])
输出:
-----BEGIN-----fdsfdsfdasadsadfsdfafsdafsadfafsdafadsfadsfdsa-----END----------BEGIN-----fsdfdssdfdsfadsfasdfsdafdsa-----END----------BEGIN-----fsdafadsdsfsdafasdsdafdsafadssad-----END-----