我正在尝试运行以下PyTorch人检测示例:
https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html
我正在使用Ubuntu 18.04。以下是我已执行的步骤的摘要:
1)在具有GTX 1650 GPU的Lenovo ThinkPad X1 Extreme Gen 2上安装了Ubuntu 18.04。
2)执行标准CUDA 10.0 / cuDNN 7.4安装。我不想重述所有步骤,因为这篇文章已经足够长了。这是一个标准过程,几乎所有通过谷歌搜索找到的链接都是我遵循的。
3)安装torch
和torchvision
4)从PyTorch网站上的此链接:
https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html
我从底部的链接中保存了可用的源:
https://pytorch.org/tutorials/_static/tv-training-code.py
在我创建的目录中,PennFudanExample
5)我做了以下工作(位于上面链接的笔记本的顶部):
将CoCo API安装到Python中:
cd ~
git clone https://github.com/cocodataset/cocoapi.git
cd cocoapi/PythonAPI
在gedit中打开Makefile,将“ python”的两个实例更改为“ python3”,然后:
python3 setup.py build_ext --inplace
sudo python3 setup.py install
获取上面的链接文件需要运行的必要文件:
cd ~
git clone https://github.com/pytorch/vision.git
cd vision
git checkout v0.5.0
从~/vision/references/detection
,将coco_eval.py
,coco_utils.py
,engine.py
,transforms.py
和utils.py
复制到目录PennFudanExample
。
6)从上一页的链接下载Penn Fudan行人数据集:
https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip
然后解压缩并放入目录PennFudanExample
7)我对tv-training-code.py
所做的唯一更改是将训练批次大小从2更改为1,以防止GPU内存不足崩溃,请参阅我在此处所做的另一篇文章:
PyTorch Object Detection with GPU on Ubuntu 18.04 - RuntimeError: CUDA out of memory. Tried to allocate xx.xx MiB
这是tv-training-code.py
,因为我正在使用我提到的小批量编辑来运行它:
# Sample code from the TorchVision 0.3 Object Detection Finetuning Tutorial
# http://pytorch.org/tutorials/intermediate/torchvision_tutorial.html
import os
import numpy as np
import torch
from PIL import Image
import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor
from engine import train_one_epoch, evaluate
import utils
import transforms as T
class PennFudanDataset(object):
def __init__(self, root, transforms):
self.root = root
self.transforms = transforms
# load all image files, sorting them to
# ensure that they are aligned
self.imgs = list(sorted(os.listdir(os.path.join(root, "PNGImages"))))
self.masks = list(sorted(os.listdir(os.path.join(root, "PedMasks"))))
def __getitem__(self, idx):
# load images ad masks
img_path = os.path.join(self.root, "PNGImages", self.imgs[idx])
mask_path = os.path.join(self.root, "PedMasks", self.masks[idx])
img = Image.open(img_path).convert("RGB")
# note that we haven't converted the mask to RGB,
# because each color corresponds to a different instance
# with 0 being background
mask = Image.open(mask_path)
mask = np.array(mask)
# instances are encoded as different colors
obj_ids = np.unique(mask)
# first id is the background, so remove it
obj_ids = obj_ids[1:]
# split the color-encoded mask into a set
# of binary masks
masks = mask == obj_ids[:, None, None]
# get bounding box coordinates for each mask
num_objs = len(obj_ids)
boxes = []
for i in range(num_objs):
pos = np.where(masks[i])
xmin = np.min(pos[1])
xmax = np.max(pos[1])
ymin = np.min(pos[0])
ymax = np.max(pos[0])
boxes.append([xmin, ymin, xmax, ymax])
boxes = torch.as_tensor(boxes, dtype=torch.float32)
# there is only one class
labels = torch.ones((num_objs,), dtype=torch.int64)
masks = torch.as_tensor(masks, dtype=torch.uint8)
image_id = torch.tensor([idx])
area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
# suppose all instances are not crowd
iscrowd = torch.zeros((num_objs,), dtype=torch.int64)
target = {}
target["boxes"] = boxes
target["labels"] = labels
target["masks"] = masks
target["image_id"] = image_id
target["area"] = area
target["iscrowd"] = iscrowd
if self.transforms is not None:
img, target = self.transforms(img, target)
return img, target
def __len__(self):
return len(self.imgs)
def get_model_instance_segmentation(num_classes):
# load an instance segmentation model pre-trained pre-trained on COCO
model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
# get number of input features for the classifier
in_features = model.roi_heads.box_predictor.cls_score.in_features
# replace the pre-trained head with a new one
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
# now get the number of input features for the mask classifier
in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
hidden_layer = 256
# and replace the mask predictor with a new one
model.roi_heads.mask_predictor = MaskRCNNPredictor(in_features_mask,
hidden_layer,
num_classes)
return model
def get_transform(train):
transforms = []
transforms.append(T.ToTensor())
if train:
transforms.append(T.RandomHorizontalFlip(0.5))
return T.Compose(transforms)
def main():
# train on the GPU or on the CPU, if a GPU is not available
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# our dataset has two classes only - background and person
num_classes = 2
# use our dataset and defined transformations
dataset = PennFudanDataset('PennFudanPed', get_transform(train=True))
dataset_test = PennFudanDataset('PennFudanPed', get_transform(train=False))
# split the dataset in train and test set
indices = torch.randperm(len(dataset)).tolist()
dataset = torch.utils.data.Subset(dataset, indices[:-50])
dataset_test = torch.utils.data.Subset(dataset_test, indices[-50:])
# define training and validation data loaders
# !!!! CHANGE HERE !!!! For this function call, I changed the batch_size param value from 2 to 1, otherwise this file is exactly as provided from the PyTorch website !!!!
data_loader = torch.utils.data.DataLoader(
dataset, batch_size=1, shuffle=True, num_workers=4,
collate_fn=utils.collate_fn)
data_loader_test = torch.utils.data.DataLoader(
dataset_test, batch_size=1, shuffle=False, num_workers=4,
collate_fn=utils.collate_fn)
# get the model using our helper function
model = get_model_instance_segmentation(num_classes)
# move model to the right device
model.to(device)
# construct an optimizer
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005,
momentum=0.9, weight_decay=0.0005)
# and a learning rate scheduler
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
step_size=3,
gamma=0.1)
# let's train it for 10 epochs
num_epochs = 10
for epoch in range(num_epochs):
# train for one epoch, printing every 10 iterations
train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
# update the learning rate
lr_scheduler.step()
# evaluate on the test dataset
evaluate(model, data_loader_test, device=device)
print("That's it!")
if __name__ == "__main__":
main()
这是全文输出,包括我当前遇到的错误:
Epoch: [0] [ 0/120] eta: 0:01:41 lr: 0.000047 loss: 7.3028 (7.3028) loss_classifier: 1.0316 (1.0316) loss_box_reg: 0.0827 (0.0827) loss_mask: 6.1742 (6.1742) loss_objectness: 0.0097 (0.0097) loss_rpn_box_reg: 0.0046 (0.0046) time: 0.8468 data: 0.0803 max mem: 1067
Epoch: [0] [ 10/120] eta: 0:01:02 lr: 0.000467 loss: 2.0995 (3.5058) loss_classifier: 0.6684 (0.6453) loss_box_reg: 0.0999 (0.1244) loss_mask: 1.2471 (2.7069) loss_objectness: 0.0187 (0.0235) loss_rpn_box_reg: 0.0060 (0.0057) time: 0.5645 data: 0.0089 max mem: 1499
Epoch: [0] [ 20/120] eta: 0:00:56 lr: 0.000886 loss: 1.0166 (2.1789) loss_classifier: 0.2844 (0.4347) loss_box_reg: 0.1631 (0.1540) loss_mask: 0.4710 (1.5562) loss_objectness: 0.0187 (0.0242) loss_rpn_box_reg: 0.0082 (0.0099) time: 0.5524 data: 0.0020 max mem: 1704
Epoch: [0] [ 30/120] eta: 0:00:50 lr: 0.001306 loss: 0.5554 (1.6488) loss_classifier: 0.1258 (0.3350) loss_box_reg: 0.1356 (0.1488) loss_mask: 0.2355 (1.1285) loss_objectness: 0.0142 (0.0224) loss_rpn_box_reg: 0.0127 (0.0142) time: 0.5653 data: 0.0023 max mem: 1756
Epoch: [0] [ 40/120] eta: 0:00:45 lr: 0.001726 loss: 0.4520 (1.3614) loss_classifier: 0.1055 (0.2773) loss_box_reg: 0.1101 (0.1530) loss_mask: 0.1984 (0.8981) loss_objectness: 0.0063 (0.0189) loss_rpn_box_reg: 0.0139 (0.0140) time: 0.5621 data: 0.0023 max mem: 1776
Epoch: [0] [ 50/120] eta: 0:00:39 lr: 0.002146 loss: 0.3448 (1.1635) loss_classifier: 0.0622 (0.2346) loss_box_reg: 0.1004 (0.1438) loss_mask: 0.1650 (0.7547) loss_objectness: 0.0033 (0.0172) loss_rpn_box_reg: 0.0069 (0.0131) time: 0.5535 data: 0.0022 max mem: 1776
Epoch: [0] [ 60/120] eta: 0:00:33 lr: 0.002565 loss: 0.3292 (1.0543) loss_classifier: 0.0549 (0.2101) loss_box_reg: 0.1113 (0.1486) loss_mask: 0.1596 (0.6668) loss_objectness: 0.0017 (0.0148) loss_rpn_box_reg: 0.0082 (0.0140) time: 0.5590 data: 0.0022 max mem: 1776
Epoch: [0] [ 70/120] eta: 0:00:28 lr: 0.002985 loss: 0.4105 (0.9581) loss_classifier: 0.0534 (0.1877) loss_box_reg: 0.1049 (0.1438) loss_mask: 0.1709 (0.5995) loss_objectness: 0.0015 (0.0132) loss_rpn_box_reg: 0.0133 (0.0138) time: 0.5884 data: 0.0023 max mem: 1783
Epoch: [0] [ 80/120] eta: 0:00:22 lr: 0.003405 loss: 0.3080 (0.8817) loss_classifier: 0.0441 (0.1706) loss_box_reg: 0.0875 (0.1343) loss_mask: 0.1960 (0.5510) loss_objectness: 0.0015 (0.0122) loss_rpn_box_reg: 0.0071 (0.0137) time: 0.5812 data: 0.0023 max mem: 1783
Epoch: [0] [ 90/120] eta: 0:00:17 lr: 0.003825 loss: 0.2817 (0.8171) loss_classifier: 0.0397 (0.1570) loss_box_reg: 0.0499 (0.1257) loss_mask: 0.1777 (0.5098) loss_objectness: 0.0008 (0.0111) loss_rpn_box_reg: 0.0068 (0.0136) time: 0.5644 data: 0.0022 max mem: 1794
Epoch: [0] [100/120] eta: 0:00:11 lr: 0.004244 loss: 0.2139 (0.7569) loss_classifier: 0.0310 (0.1446) loss_box_reg: 0.0327 (0.1163) loss_mask: 0.1573 (0.4731) loss_objectness: 0.0003 (0.0101) loss_rpn_box_reg: 0.0050 (0.0128) time: 0.5685 data: 0.0022 max mem: 1794
Epoch: [0] [110/120] eta: 0:00:05 lr: 0.004664 loss: 0.2139 (0.7160) loss_classifier: 0.0325 (0.1358) loss_box_reg: 0.0327 (0.1105) loss_mask: 0.1572 (0.4477) loss_objectness: 0.0003 (0.0093) loss_rpn_box_reg: 0.0047 (0.0128) time: 0.5775 data: 0.0022 max mem: 1794
Epoch: [0] [119/120] eta: 0:00:00 lr: 0.005000 loss: 0.2486 (0.6830) loss_classifier: 0.0330 (0.1282) loss_box_reg: 0.0360 (0.1051) loss_mask: 0.1686 (0.4284) loss_objectness: 0.0003 (0.0086) loss_rpn_box_reg: 0.0074 (0.0125) time: 0.5655 data: 0.0022 max mem: 1794
Epoch: [0] Total time: 0:01:08 (0.5676 s / it)
creating index...
index created!
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/numpy/core/function_base.py", line 117, in linspace
num = operator.index(num)
TypeError: 'numpy.float64' object cannot be interpreted as an integer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/cdahms/workspace-apps/PennFudanExample/tv-training-code.py", line 166, in <module>
main()
File "/home/cdahms/workspace-apps/PennFudanExample/tv-training-code.py", line 161, in main
evaluate(model, data_loader_test, device=device)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad
return func(*args, **kwargs)
File "/home/cdahms/workspace-apps/PennFudanExample/engine.py", line 80, in evaluate
coco_evaluator = CocoEvaluator(coco, iou_types)
File "/home/cdahms/workspace-apps/PennFudanExample/coco_eval.py", line 28, in __init__
self.coco_eval[iou_type] = COCOeval(coco_gt, iouType=iou_type)
File "/home/cdahms/models/research/pycocotools/cocoeval.py", line 75, in __init__
self.params = Params(iouType=iouType) # parameters
File "/home/cdahms/models/research/pycocotools/cocoeval.py", line 527, in __init__
self.setDetParams()
File "/home/cdahms/models/research/pycocotools/cocoeval.py", line 506, in setDetParams
self.iouThrs = np.linspace(.5, 0.95, np.round((0.95 - .5) / .05) + 1, endpoint=True)
File "<__array_function__ internals>", line 6, in linspace
File "/usr/local/lib/python3.6/dist-packages/numpy/core/function_base.py", line 121, in linspace
.format(type(num)))
TypeError: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.
Process finished with exit code 1
真正奇怪的是,在解决了上述GPU错误之后,该错误每天工作约1/2天,而现在却出现了这个错误,我发誓我什么都没做。
我尝试卸载并重新安装torch
,torchvision
,pycocotools
,并且为了复制文件coco_eval.py
,coco_utils.py
,engine.py
,transforms.py
和utils.py
,我尝试检查退出torchvision v0.5.0,v0.4.2,并使用最新的提交,都会产生相同的错误。
另外,我昨天(圣诞节)在家工作,家用计算机上也没有发生此错误,该计算机也是具有NVIDIA GPU的Ubuntu 18.04。
在针对该错误的Google搜索中,一个相对常见的建议是将numpy
追溯到1.11.0,但是该版本现在确实很旧,因此可能会导致其他软件包出现问题。
同样在Googleing中针对此错误,似乎一般的解决方法是在int
的某个位置添加强制转换或将/
的除数更改为//
,但是我真的很犹豫是否将内部更改更改为pycocotools
或更糟仍在numpy
中。另外,由于以前没有发生错误,也没有在另一台计算机上发生错误,因此我也不认为这是个好主意。
幸运的是我可以注释掉
evaluate(model, data_loader_test, device=device)
到目前为止,尽管我没有获得评估数据(平均平均精度等),但培训将完成。
关于这一点,我唯一能想到的就是格式化HD并重新安装Ubuntu 18.04以及其他所有功能,但这至少需要一天的时间,如果再次发生这种情况,我真的很想知道可能是什么造成它。
有想法吗?有什么建议吗?我应该检查的其他东西?
-编辑-
在遇到问题的同一台计算机上重新测试后,我发现在使用TensorFlow对象检测API时,评估步骤也会发生相同的错误。
参考方案
!@#$%^&
我终于在大约15个小时的时间里弄清楚了这一点,因为事实证明numpy 1.18.0(在我撰写本文时已发布5天前发布)打破了TensorFlow和PyTorch对象检测的评估过程。长话短说,解决方法是:
sudo -H pip3 install numpy==1.17.4
我还可以提及几件事:
-numpy 1.17.4已于2019年11月10日发布,因此在相当长的一段时间内应该仍然不错
-现在有一个用于pycocotools的pip包,因此您可以简单地执行以下操作(而不是上面的过程(克隆和构建)):
sudo -H pip3 install pycocotools
-更新-
现在,此提交已在pycocotools
中修复:
https://github.com/cocodataset/cocoapi/pull/354
另请参见此(封闭)问题以获取更多背景信息:
https://github.com/numpy/numpy/issues/15192
pycocotools
的更新版本何时将其放入pycocotools pip3 package
中,我不确定。
Improve this question 我已经将numpy更新为1.14.0。我使用Windows10。我尝试运行我的代码,但出现此错误: AttributeError:模块“ numpy”没有属性“ square”这是我的进口商品:%matplotlib inline import matplotlib.pyplot as plt import ten…
numpy.savetxt“元组索引超出范围”? - python我试图在文本文件中写几行,这是我使用的代码:import numpy as np # Generate some test data data = np.arange(0.0,1000.0,50.0) with file('test.txt', 'w') as outfile: outfile.write('…
Python pytz时区函数返回的时区为9分钟 - python由于某些原因,我无法从以下代码中找出原因:>>> from pytz import timezone >>> timezone('America/Chicago') 我得到:<DstTzInfo 'America/Chicago' LMT-1 day, 18:09:00 STD…
preg_replace排除<a href='''> </a> PHP - php我正在使用preg_replace来替换带有href标签的文本中的关键字,我的正则表达式工作非常好,现在我的代码是:$newstring2 = preg_replace("/\p{L}*?".preg_quote($match[$i])."\p{L}*/ui", "<a href='"…
在Numpy数组中使用Dictionary不会使该数组具有单一数据类型 - python我是python的新手,正在学习Numpy。我已经阅读和测试的是np.array具有单一数据类型。当我在普通代码上使用它时,它可以正常工作并表现良好。即import numpy as np np1 = np.array([1,'2' , True]) for i in np1: print(type(i)) 答案是<class …