对scipy.sparse.csr_matrix中的行求和 - python

我的csr_matrix很大，我想添加行并获得具有相同列数但行数减少的新csr_matrix。 (上下文:该矩阵是从sklearn CountVectorizer获得的文档项矩阵，我希望能够根据与这些文档相关的代码快速组合文档)

举一个最小的例子，这是我的矩阵:

import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse import vstack

row = np.array([0, 4, 1, 3, 2])
col = np.array([0, 2, 2, 0, 1])
dat = np.array([1, 2, 3, 4, 5])
A = csr_matrix((dat, (row, col)), shape=(5, 5))
print A.toarray()

[[1 0 0 0 0]
 [0 0 3 0 0]
 [0 5 0 0 0]
 [4 0 0 0 0]
 [0 0 2 0 0]]

不能说我想要一个新的矩阵B，其中行(1，4)和(2，3，5)通过求和而合并在一起，看起来像这样:

[[5 0 0 0 0]
 [0 5 5 0 0]]

并且应该再次采用稀疏格式(因为我正在使用的实际数据量很大)。我试图对矩阵的切片求和，然后将其堆叠:

idx1 = [1, 4]
idx2 = [2, 3, 5]
A_sub1 = A[idx1, :].sum(axis=1)
A_sub2 = A[idx2, :].sum(axis=1)
B = vstack((A_sub1, A_sub2))

但是，这仅给切片中的非零列提供了求和值，因此我无法将其与其他切片结合使用，因为求和切片中的列数不同。

我觉得必须有一个简单的方法来做到这一点。但是我在网上或文档中都找不到对此的任何讨论。我想念什么？

感谢您的帮助

python大神给出的解决方案

请注意，您可以通过仔细构造另一个矩阵来做到这一点。这是对密集矩阵的工作方式:

>>> S = np.array([[1, 0, 0, 1, 0,], [0, 1, 1, 0, 1]])
>>> np.dot(S, A.toarray())
array([[5, 0, 0, 0, 0],
       [0, 5, 5, 0, 0]])
>>>

稀疏版本只是稍微复杂一点。有关应将哪些行加在一起的信息以row编码:

col = range(5)
row = [0, 1, 1, 0, 1]
dat = [1, 1, 1, 1, 1]
S = csr_matrix((dat, (row, col)), shape=(2, 5))
result = S * A
# check that the result is another sparse matrix
print type(result)
# check that the values are the ones we want
print result.toarray()

输出:

<class 'scipy.sparse.csr.csr_matrix'>
[[5 0 0 0 0]
 [0 5 5 0 0]]

通过在row中包含更高的值并相应地扩展S的形状，可以处理输出中的更多行。

腾讯的同事天天给我安利让我看《三体》，说马化腾和雷军也在…

腾讯的同事天天给我安利让我看《三体》，说马化腾和雷军也在看。自己强行看了两个月，全部给看完了。感觉这文笔也就我读初中的水平……而且写着国内的一些情况，外国人能理解吗？这书为什么会这么火？这水平我也可以去写呀[笑哭][笑哭][笑哭] 招商银行员工：可以写赶紧写一个啊，能拿科幻文学雨果奖。包清白：哦楼主：pei ！tui ！你也配姓龙楼主：@赵龙王呵呵 […]