用于CUDA的Numba.vectorize:返回数组的正确签名是什么？ - python

我具有以下结构的功能，

@numba.jit(nopython = True)
def foo(X,N):
    '''
    :param X: 1D numpy array
    :param N: Integer
    :rtype: 2D numpy array of shape len(X) x N
    '''
    out = np.ones((len(X),N))
    out[:,0]  = X 
    for i in range(1,N):
        out[:,i] = X**i+out[:,i-1] 
    return out

我现在正尝试在我的GPU上运行。
到目前为止，我尝试以非矢量形式编写函数（即分别对待X的每个条目），并将return数组作为输入传递：

def foo_cuda(x,N,out):
    '''
    :param x: Scalar
    :param N: Integer
    :rtype: 1D numpy array of length N
    '''
    out[0] = x
    for i in range(1,N):
        out[i] = x**i+out[i-1]

但是，我不知道该功能使用什么装饰器。如果我用

@numba.vectorize([(float64,int64,float64[:])],target = 'cuda')我得到TypeError: Buffer dtype cannot be buffer
@numba.guvectorize([(float64,int64,float64[:])],'(),()->(n)',target = 'cuda')我得到NameError: undefined output symbols: n

用于我的目的的正确装饰器是什么？

我希望能够以与foo_cuda结束时大致相同的方式调用foo，即传递一个1D数组X，一个整数N和一个2D数组out，其中充满了结果。

更新

我函数的numpy.vectorize版本是

def foo_np(x,N):
    '''
    :param x: Scalar
    :param N: Integer
    :rtype: 1D numpy array of length N
    '''
    out = np.zeros(N)
    out[0] = x
    for i in range(1,N):
        out[i] = x**i+out[i-1]
    return out
foo_ve = np.vectorize(foo_np,signature='(),()->(n)')

但是，我无法在numba中创建输出数组（out = np.zeros(N)）（cuda.local.array(N,dtype=float64)也会失败），这使我无法使用@numba.vectorize('void(float64,int64)',target='cuda')。我试图通过将输出数组传递给函数并将其添加到签名中来解决此问题（请参见上面的尝试1），但是出现了错误。

更新2

实际功能如下：

@numba.jit(nopython = True)
def foo(X,N):
    '''
    :param X: 1D numpy array
    :param N: Integer >= 2
    :rtype: 2D numpy array of shape len(X) x N
    '''
    out = np.ones((X.shape[0],N))
    out[:,1] = X
    for i in range(2,N):
        out[:,i] = X*out[:,i-1] - (i-1)*out[:,i-2] 
    c = 1
    for i in range(2,N):#Note that this loop cannot be combined with the one above!
        c *= i
        out[:,i] /= math.sqrt(c)
    return out

参考方案

我对此进行了深入研究，所以我将分享我所拥有的。我不确定这是否是完整的答案，但它可能会解决您的一些问题。

为此，最好的基于numba的方法是使用numba CUDA（jit）编写自己的“自定义” CUDA内核。这方面的一个示例是here表示归约或here表示矩阵乘法。要正确地执行此操作，需要学习有关CUDA编程的知识。但是，这似乎并不是您要遵循的方向。

另外，您的问题是着重于使用numba向量化来生成GPU代码。 numba vectorize装饰器用于对标量输入和输出进行操作的函数，并且向量化将其应用于矩阵输入/矩阵输出。

对于不适合此功能的功能，例如在矢量或标量上运行但产生矢量的对象，或在一个或两个矢量上操作并产生矢量或标量输出的对象，numba提供了广义的guvectorize。

从最简单的示例开始，我们可以通过guvectorize来实现：

到目前为止，我尝试以非矢量形式编写函数（即分别对待X的每个条目），并将return数组作为输入传递：

def foo_cuda(x,N,out):
    '''
    :param x: Scalar
    :param N: Integer
    :rtype: 1D numpy array of length N
    '''
    out[0] = x
    for i in range(1,N):
        out[i] = x**i+out[i-1]

这种获取标量（每个函数调用）并返回向量（每个函数调用）的意图在guvectorize的能力范围内（有一些限制/注意事项-请参阅底部的注释）。

这是一个从示例代码here派生的示例：

# cat t2.py
from __future__ import print_function

import sys

import numpy as np

from numba import guvectorize, cuda

if sys.version_info[0] == 2:
    range = xrange


#    function type:
#        - has void return type
#
#    signature: (n)->(n)
#        - the function takes an array of n-element and output same.

@guvectorize(['void(float32[:], float32[:])'], '(n) ->(n)', target='cuda')
def my_func(inp, out):
        tmp1 = 0.
        tmp = inp[0]
        for i in range(out.shape[0]):
            tmp1 += tmp
            out[i] = tmp1
            tmp *= inp[0]

# set up input data
rows = 1280000 # shape[0]
cols = 4   # shape[1]
inp = np.zeros(rows*cols, dtype=np.float32).reshape(rows, cols)
for i in range(inp.shape[0]):
    inp[i,0] = (i%4)+1
# invoke on CUDA with manually managed memory

dev_inp = cuda.to_device(inp)             # alloc and copy input data

my_func(dev_inp, dev_inp)             # invoke the gufunc

dev_inp.copy_to_host(inp)                 # retrieve the result

# print out
print('result'.center(80, '-'))
print(inp)

# nvprof --print-gpu-trace python t2.py
==4773== NVPROF is profiling process 4773, command: python t2.py
-------------------------------------result-------------------------------------
[[  1.   2.   3.   4.]
 [  2.   6.  14.  30.]
 [  3.  12.  39. 120.]
 ...
 [  2.   6.  14.  30.]
 [  3.  12.  39. 120.]
 [  4.  20.  84. 340.]]
==4773== Profiling application: python t2.py
==4773== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
994.08ms  5.5731ms                    -               -         -         -         -  19.531MB  3.4224GB/s    Pageable      Device  Tesla P100-PCIE         1         7  [CUDA memcpy HtoD]
1.00083s  159.20us          (20000 1 1)        (64 1 1)        22        0B        0B         -           -           -           -  Tesla P100-PCIE         1         7  cudapy::__main__::__gufunc_my_func$242(Array<float, int=2, A, mutable, aligned>, Array<float, int=2, A, mutable, aligned>) [48]
1.00100s  4.8017ms                    -               -         -         -         -  19.531MB  3.9722GB/s      Device    Pageable  Tesla P100-PCIE         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy

# nvprof --metrics gst_efficiency python t2.py
==4787== NVPROF is profiling process 4787, command: python t2.py
==4787== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "cudapy::__main__::__gufunc_my_func$242(Array<float, int=2, A, mutable, aligned>, Array<float, int=2, A, mutable, aligned>)" (done)
-------------------------------------result-------------------------------------
[[  1.   2.   3.   4.]
 [  2.   6.  14.  30.]
 [  3.  12.  39. 120.]
 ...
 [  2.   6.  14.  30.]
 [  3.  12.  39. 120.]
 [  4.  20.  84. 340.]]
==4787== Profiling application: python t2.py
==4787== Profiling result:
==4787== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla P100-PCIE-16GB (0)"
    Kernel: cudapy::__main__::__gufunc_my_func$242(Array<float, int=2, A, mutable, aligned>, Array<float, int=2, A, mutable, aligned>)
          1                            gst_efficiency            Global Memory Store Efficiency      25.00%      25.00%      25.00%
#

要在guvectorize上下文中回答您的特定问题：

用于我的目的的正确装饰器是什么？

函数类型规范如下所示：

['<return-type>(<parameter 0 type>, <parameter 1 type>, ...)']

对于guvectorize，返回类型应始终为void。参数的类型应与要使用该函数的类型相匹配。对于guvectorize，我们将类型视为我们将传递的实际输入和输出数据类型的“切片”，“切片”是单个函数调用将要操作的内容。然后，矢量化会将单独的函数调用应用于输入/输出数据的每个“切片”，以覆盖输入/输出数据集的整个大小。以我的示例为例，我建议传递一个输入向量（float32[:]）和一个输出向量（float32[:]）。

函数签名显示输入的尺寸，然后显示输出的尺寸：

(x)...->(x)

每个参数都可以是多维的（尽管对于矢量化，它仍应仅代表输入/输出的“切片”），并且标量可以由()表示。因为我们希望输出“ slice”是一定长度的结果向量，例如n，所以这里会出现皱纹。 numba guvectorize似乎不允许指定不属于已指定输入尺寸的输出尺寸（n）。因此，尽管此函数仅真正需要标量输入，但我选择通过传递矢量“切片”用于输入和输出来解决此问题。实际上，按照我编写该函数的方式，该函数可以将相同的数据用于输入和输出，因此对于这种特殊情况，此解决方法并没有真正的“开销”。

有关实现/性能的一些说明：

这在第一个数组（shape[0]）维度上“并行化”。所有计算都在GPU上执行。用于执行输出的每个矢量切片的计算由GPU上的单个线程执行。但是对于大数据集（第一维），这将为GPU暴露大量并行工作（线程）。第二个维度（即循环）的工作是在每个线程的上下文中进行的。使用vectorize或guvectorize几乎肯定不可能尝试在第二维上进行并行化（例如创建前缀总和），但是使用numba cuda（jit）则可以尝试进行并行化。我们可以通过研究以上工作示例中的（第一个）nvprof --print-gpu-trace输出，并注意到它构成了20000个块，每个块包含64个线程，总共1280000个线程，与我们的第一个数组维匹配，从而确定了并行化维度。
如前所述，这种实现有些棘手，因为即使我只需要并使用标量，我也要传递向量输入。这是为了解决numba中似乎存在的局限性，据我所知，您无法指定像()->(n)这样的尺寸签名。（注意：经过进一步的研究，我认为正确的做法是定义仅输入的ufunc，将输入向量/矩阵作为单个参数而不是两次传递，并仅使用(n)作为维数签名，而不是(n)->(n)。请参见here）。
从存储器访问模式的角度来看，此实现不是最佳的。从（第二个）nvprof --metrics gst_efficiency输出中可以明显看出这一点。我认为这样做的原因也可能是numba设计的限制（当前）。在本示例中，当通过numba呈现数组以通过guvectorize进行分配/并行化时，它将数组切成几行，并将每一行传递给函数调用以进行处理。对于此特定示例，一种更有效的实现方式是转置我们的数组存储系统，并将数组切成多个列，然后让每个函数调用在一个列上工作。这样做的原因与GPU的行为和设计有关，在此不做评论。但是，该指标表明，使用这种每线程行访问模式只能实现25％的效率。使用转置数据和每线程列方法很容易实现100％的效率，但是我不知道如何使用numba guvectorize来实现这一点。我知道解决无效访问的唯一选择是恢复为编写numba CUDA内核（即CUDA jit）。（我尝试研究的另一种选择是看看我们是否可以指示ufunc在其签名中采用列向量，例如void(float32[[:]],float32[[:]])之类的东西，希望numba可以按列对事物进行切片，但是对此没有运气。）

使用前面的示例作为模型，我们可以使用guvectorize创建类似的内容来解决Update 2中的“实际功能”。我将这里的数组大小减小了5（= len(X)），减小了4（= N）：

# cat t3.py
from __future__ import print_function

import sys
import numpy as np
import numba

from numba import guvectorize, cuda
import math

if sys.version_info[0] == 2:
    range = xrange

@numba.jit(nopython = True)
def foo(X,N):
    '''
    :param X: 1D numpy array
    :param N: Integer >= 2
    :rtype: 2D numpy array of shape len(X) x N
    '''
    out = np.ones((X.shape[0],N))
    out[:,1] = X
    for i in range(2,N):
        out[:,i] = X*out[:,i-1] - (i-1)*out[:,i-2]
    c = 1
    for i in range(2,N):#Note that this loop cannot be combined with the one above!
        c *= i
        out[:,i] /= math.sqrt(c)
    return out

#    function type:
#        - has void return type
#
#    signature: (n)->(n)
#        - the function takes an array of n-element and output same.

@guvectorize(['void(float32[:], float32[:])'], '(n) ->(n)', target='cuda')
def my_func(inp, out):
        for i in range(2,out.shape[0]):
            out[i] = out[1]*out[i-1] - (i-1)*out[i-2]
        c = 1.
        for i in range(2,out.shape[0]):
            c *= i
            out[i] /= math.sqrt(c)

# set up input data
rows = 5   # shape[0]
cols = 4   # shape[1]
inp = np.ones(rows*cols, dtype=np.float32).reshape(rows, cols)
for i in range(inp.shape[0]):
    inp[i,1] = i
# invoke on CUDA with manually managed memory

dev_inp = cuda.to_device(inp)             # alloc and copy input data

my_func(dev_inp, dev_inp)             # invoke the gufunc

dev_inp.copy_to_host(inp)                 # retrieve the result

# print out
print('gpu result'.center(80, '-'))
print(inp)

rrows = rows
rcols = cols
rin = np.zeros(rrows, dtype=np.float32)
for i in range(rin.shape[0]):
    rin[i] = i
rout = foo(rin, rcols)
print('cpu result'.center(80, '-'))
print(rout)


# python t3.py
-----------------------------------gpu result-----------------------------------
[[ 1.          0.         -0.70710677 -0.        ]
 [ 1.          1.          0.         -0.8164966 ]
 [ 1.          2.          2.1213202   0.8164966 ]
 [ 1.          3.          5.656854    7.3484693 ]
 [ 1.          4.         10.606602   21.22891   ]]
-----------------------------------cpu result-----------------------------------
[[ 1.          0.         -0.70710678 -0.        ]
 [ 1.          1.          0.         -0.81649658]
 [ 1.          2.          2.12132034  0.81649658]
 [ 1.          3.          5.65685425  7.34846923]
 [ 1.          4.         10.60660172 21.2289111 ]]
#

This answer对与[cc]和vectorize有关的“切片”的概念进行了附加讨论，并列举了许多示例。

在返回'Response'(Python)中传递多个参数 - python

我在Angular工作，正在使用Http请求和响应。是否可以在“响应”中发送多个参数。角度文件：this.http.get("api/agent/applicationaware").subscribe((data:any)... python文件：def get(request): ... return Response(seriali…

Python exchangelib在子文件夹中读取邮件 - python

我想从Outlook邮箱的子文件夹中读取邮件。Inbox ├──myfolder 我可以使用account.inbox.all()阅读收件箱，但我想阅读myfolder中的邮件我尝试了此页面folder部分中的内容，但无法正确完成https://pypi.python.org/pypi/exchangelib/ 参考方案您需要首先掌握Folder的myfo…

如何修复AttributeError:模块'numpy'没有属性'square' - python

Improve this question 我已经将numpy更新为1.14.0。我使用Windows10。我尝试运行我的代码，但出现此错误： AttributeError：模块“ numpy”没有属性“ square”这是我的进口商品：%matplotlib inline import matplotlib.pyplot as plt import ten…

R'relaimpo'软件包的Python端口 - python

我需要计算Lindeman-Merenda-Gold（LMG）分数，以进行回归分析。我发现R语言的relaimpo包下有该文件。不幸的是，我对R没有任何经验。我检查了互联网，但找不到。这个程序包有python端口吗？如果不存在，是否可以通过python使用该包？ python参考方案最近，我遇到了pingouin库。

AttributeError:'AnonymousUserMixin'对象没有属性'can' - python

烧瓶学习问题为了定制对匿名用户的要求，我在模型中设置了一个类： class MyAnonymousUser(AnonymousUserMixin): def can(self, permissions): return False def is_administrator(self): return False login_manager.anonymous…

用于CUDA的Numba.vectorize:返回数组的正确签名是什么？ - python

腾讯的同事天天给我安利让我看《三体》，说马化腾和雷军也在…