GridSearch以获得最佳模型:保存和加载参数 - python

我喜欢运行以下工作流程:

选择用于文本向量化的模型
定义参数列表
在参数上应用带有GridSearchCV的管道,使用LogisticRegression()作为基线以找到最佳的模型参数
保存最佳模型(参数)
加载最佳模型参数,以便我们可以在此定义的模型上应用一系列其他分类器。

这是您可以复制的代码:

GridSearch:

%%time
import numpy as np
import pandas as pd
from sklearn.externals import joblib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from gensim.utils import simple_preprocess
np.random.seed(0)

data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')
X_train, X_test, y_train, y_test = train_test_split([simple_preprocess(doc) for doc in data.text],
                                                    data.label, random_state=0)

# Find best Tfidf model using LR
pipeline = Pipeline([
  ('tfidf', TfidfVectorizer(preprocessor=' '.join, tokenizer=None)),
  ('clf', LogisticRegression())
  ])

parameters = {
              'tfidf__max_df': [0.25, 0.5, 0.75, 1.0],
              'tfidf__smooth_idf': (True, False),
              'tfidf__norm': ('l1', 'l2', None),
              }

grid = GridSearchCV(pipeline, parameters, cv=2, verbose=1)
grid.fit(X_train, y_train)

print(grid.best_params_)

# Save model
#joblib.dump(grid.best_estimator_, 'best_tfidf.pkl', compress = 1) # this unfortunately includes the LogReg
joblib.dump(grid.best_params_, 'best_tfidf.pkl', compress = 1) # Only best parameters

对24位候选人各进行2次折叠,共48次
{'tfidf__smooth_idf':True,'tfidf__norm':'l2','tfidf__max_df':0.25}

使用最佳参数加载模型:

from sklearn.model_selection import GridSearchCV

# Load best parameters
tfidf_params = joblib.load('best_tfidf.pkl')

pipeline = Pipeline([
  ('vec', TfidfVectorizer(preprocessor=' '.join, tokenizer=None).set_params(**tfidf_params)), # here is the issue?
  ('clf', LogisticRegression())
  ])

cval = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=5)
print("Cross-Validation Score: %s" % (np.mean(cval)))

ValueError:估算器的参数tfidf无效
TfidfVectorizer(analyzer ='word',binary = False,decode_error ='strict',
dtype =,encoding ='utf-8',input ='content',
小写=真,max_df = 1.0,max_features =无,min_df = 1,
ngram_range =(1,1),norm ='l2',
预处理器=,
smooth_idf = True,stop_words = None,strip_accents = None,
sublinear_tf = False,token_pattern ='(?u)\ b \ w \ w + \ b',
tokenizer =无,use_idf = True,词汇=无)。使用estimator.get_params().keys()检查可用参数列表。

题:

如何加载Tfidf模型的最佳参数?

参考方案

这行:

joblib.dump(grid.best_params_, 'best_tfidf.pkl', compress = 1) # Only best parameters

保存pipeline的参数,而不保存TfidfVectorizer的参数。这样做:

pipeline = Pipeline([
  # Change the name to be same as before
  ('tfidf', TfidfVectorizer(preprocessor=' '.join, tokenizer=None)),
  ('clf', LogisticRegression())
  ])

pipeline.set_params(**tfidf_params)

R'relaimpo'软件包的Python端口 - python

我需要计算Lindeman-Merenda-Gold(LMG)分数,以进行回归分析。我发现R语言的relaimpo包下有该文件。不幸的是,我对R没有任何经验。我检查了互联网,但找不到。这个程序包有python端口吗?如果不存在,是否可以通过python使用该包? python参考方案 最近,我遇到了pingouin库。

Python:传递记录器是个好主意吗? - python

我的Web服务器的API日志如下:started started succeeded failed 那是同时收到的两个请求。很难说哪一个成功或失败。为了彼此分离请求,我为每个请求创建了一个随机数,并将其用作记录器的名称logger = logging.getLogger(random_number) 日志变成[111] started [222] start…

Python-Excel导出 - python

我有以下代码:import pandas as pd import requests from bs4 import BeautifulSoup res = requests.get("https://www.bankier.pl/gielda/notowania/akcje") soup = BeautifulSoup(res.cont…

Matplotlib'粗体'字体 - python

跟随this example:import numpy as np import matplotlib.pyplot as plt fig = plt.figure() for i, label in enumerate(('A', 'B', 'C', 'D')): ax = f…

Python:如何根据另一列元素明智地查找一列中的空单元格计数? - python

df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice','Jane', 'Alice','Bob', 'Alice'], 'income…