pythonoob_Python反斜杠小问题

① 求python大神解释下这段代码，没接触过python不会啊

这就是一段构造函数。
(self,
n_estimators=10,
criterion="gini",
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0.,
max_features="auto",
max_leaf_nodes=None,
bootstrap=True,
oob_score=False,
n_jobs=1,
random_state=None,
verbose=0,
warm_start=False,
class_weight=None):
这是构造函数的参数，有默认值。

super(RandomForestClassifier, self).__init__(
base_estimator=DecisionTreeClassifier(),
n_estimators=n_estimators,
estimator_params=("criterion", "max_depth", "min_samples_split",
"min_samples_leaf", "min_weight_fraction_leaf",
"max_features", "max_leaf_nodes",
"random_state"),
bootstrap=bootstrap,
oob_score=oob_score,
n_jobs=n_jobs,
random_state=random_state,
verbose=verbose,
warm_start=warm_start,
class_weight=class_weight)

supper会调用基类构造函数，你可以认为这一串就是基类构造函数的参数。

self.criterion = criterion
self.max_depth = max_depth
self.min_samples_split = min_samples_split
self.min_samples_leaf = min_samples_leaf
self.min_weight_fraction_leaf = min_weight_fraction_leaf
self.max_features = max_features
self.max_leaf_nodes = max_leaf_nodes

这一串就是属性赋值。

② 用python写爬虫程序怎么调用工具包selenium

from selenium import webdriver # 用来驱动浏览器的
from selenium.webdriver import ActionChains # 破解滑动验证码的时候用的可以拖动图片
from selenium.webdriver.common.by import By # 按照什么方式查找，By.ID,By.CSS_SELECTOR
from selenium.webdriver.common.keys import Keys # 键盘按键操作
from selenium.webdriver.support import expected_conditions as EC # 和下面WebDriverWait一起用的
from selenium.webdriver.support.wait import WebDriverWait # 等待页面加载某些元素

③ 特征筛选（随机森林）

随机森林能够度量每个特征的重要性，我们可以依据这个重要性指标进而选择最重要的特征。sklearn中已经实现了用随机森林评估特征重要性，在训练好随机森林模型后，直接调用feature_importan ces 属性就能得到每个特征的重要性。

一般情况下，数据集的特征成百上千，因此有必要从中选取对结果影响较大的特征来进行进一步建模，相关的方法有：主成分分析、lasso等，这里我们介绍的是通过随机森林来进行筛选。

用随机森林进行特征重要性评估的思想比较简单，主要是看每个特征在随机森林中的每棵树上做了多大的贡献，然后取平均值，最后比较不同特征之间的贡献大小。

贡献度的衡量指标包括：基尼指数（gini）、袋外数据（OOB）错误率作为评价指标来衡量。

衍生知识点：权重随机森林的应用（用于增加小样本的识别概率，从而提高总体的分类准确率）

随机森林/CART树在使用时一般通过gini值作为切分节点的标准，而在加权随机森林（WRF）中，权重的本质是赋给小类较大的权重，给大类较小的权重。也就是给小类更大的惩罚。权重的作用有2个，第1点是用于切分点选择中加权计算gini值，表达式如下：

其中，N表示未分离的节点，N L 和N R 分别表示分离后的左侧节点和右侧节点，W i 为c类样本的类权重，n i 表示节点内各类样本的数量，Δi是不纯度减少量，该值越大表明分离点的分离效果越好。

第2点是在终节点，类权重用来决定其类标签，表达式如下：

参考文献：随机森林针对小样本数据类权重设置 https://wenku..com/view/.html

这里介绍通过gini值来进行评价，我们将变量的重要性评分用VIM来表示，gini值用GI表示，假设有m个特征X 1 ，X 2 ，...X c ，现在要计算出每个特征X j 的gini指数评分VIM j ，即第j个特征在随机森林所有决策树中节点分裂不纯度的平均改变量，gini指数的计算公式如下表示：

其中，k表示有k个类别，p mk 表示节点m（将特征m逐个对节点计算gini值变化量）中类别k所占的比例。

特征X j 在节点m的重要性，即节点m分枝前后的gini指数变化量为：

其中GI l 和GI r 分别表示分枝后两个新节点的gini指数。

如果特征X j 在决策树i中出现的节点在集合M中，那么X j 在第i棵树的重要性为：

假设随机森林共有n棵树，那么：

最后把所有求得的重要性评分进行归一化处理就得到重要性的评分：

通过sklearn中的随机森林返回特征的重要性：

值得庆幸的是，sklearnsklearn已经帮我们封装好了一切，我们只需要调用其中的函数即可。
我们以UCI上葡萄酒的例子为例，首先导入数据集。

然后，我们来大致看下这时一个怎么样的数据集

输出为

可见共有3个类别。然后再来看下数据的信息：

输出为

可见除去class label之外共有13个特征，数据集的大小为178。
按照常规做法，将数据集分为训练集和测试集。

好了，这样一来随机森林就训练好了，其中已经把特征的重要性评估也做好了，我们拿出来看下。

输出的结果为

对的就是这么方便。
如果要筛选出重要性比较高的变量的话，这么做就可以

输出为

瞧，这不，帮我们选好了3个重要性大于0.15的特征了吗~

[1] Raschka S. Python Machine Learning[M]. Packt Publishing, 2015.
[2] 杨凯, 侯艳, 李康. 随机森林变量重要性评分及其研究进展[J]. 2015.

④ 如何在Wordpress中添加一段代码

第一步：首先从网上下载出WordPress的插件Wp-syntax。
第二步：在编辑文章时，使用HTML的编辑方式插入以下代码<pre lang=”LANGUAGE” line=”0″>//“line为1时表示显示行号”</pre>，中间插入要插入的代码即可，LANGUAGE改为语言类型，例如php、java。line为0不显示行号，为1时显示。
附：支持的语言如下：abap, actionscript, actionscript3, ada, apache, applescript, apt_sources, asm, asp, autoit, avisynth, bash, bf, bibtex, blitzbasic, bnf, boo, c, c_mac, caddcl, cadlisp, cil, cfdg, cfm, cmake, cobol, cpp-qt, cpp, csharp, css, d, dcs, delphi, diff, div, dos, dot, eiffel, email, erlang, fo, fortran, freebasic, genero, gettext, glsl, gml, bnuplot, groovy, haskell, hq9plus, html4strict, idl, ini, inno, intercal, io, java, java5, javascript, kixtart, klonec, klonecpp, latex, lisp, locobasic, lolcode lotusformulas, lotusscript, lscript, lsl2, lua, m68k, make, matlab, mirc, mola3, mpasm, mxml, mysql, nsis, oberon2, objc, ocaml-brief, ocaml, oobas, oracle11, oracle8, pascal, per, pic16, pixelbender, perl, php-brief, php, plsql, povray, powershell, progress, prolog, properties, providex, python, qbasic, rails, rebol, reg, robots, ruby, sas, scala, scheme, scilab, sdlbasic, smalltalk, smarty, sql, tcl, teraterm, text, thinbasic, tsql, typoscript, vb, vbnet, verilog, vhdl, vim, visualfoxpro, visualprolog, whitespace, whois, winbatch, xml, xorg_conf, xpp, z80

⑤ 这段代码是什么意思这是python的一个脚本吗，这里面的call是什么意思，求大神

看样子是一个比较大的项目里的一个脚本。首先要明确这是一个windows的批处理脚本，并不是python脚本。
再大概看了下其内容，似乎是一个有关数据处理的软件（BI即商业智能），这种类似软件有可能会依赖于一些现有的数据处理库（多半是python写的），这个脚本的作用就是调用一些列工具把python代码转换成windows下的exe。

⑥ Python反斜杠小问题

原因很简单。runoob只输出runoob的原因是run[o]ob中的o并不是一个语法类型。

其次,ru oob中ru[ ]oob的有一个换行的意思，所以结果是:

1-[ru
2-oob]

如果不想让他换行，可以使用/。

类似于 (换行)的还有 (字符串制中)和 (这个还没了解可以试验一下)

都可以试试看看效果。

⑦ 跨考是考软工还是cs，软工比cs好考，补的也少，但是cs可以搞ai啊，怎么选

其实软工也可以搞AI的，现在人工智能并不是只有CS出身才能搞。软件工程的很多课程和计算机是接近的，甚至在一定程度上比计算机学的要深入。说实话，还是要看你自己在接下来的三年里怎么学，不管读软件还是CS，AI都要自己下功夫钻研，毕竟现在国内高校的人工智能专业都是刚刚起步。
另外，关于码农30多岁容易被取代的问题，不要想着一个饭碗吃一辈子，有了经验可以转管理岗位或者自己创业搞事情，而且真的有了技术，你就是不可取代的。

⑧ Bagging(装袋算法)

In the previous passage, I talked about the conception of Decision Tree and its use. Although being very powerful model that can handle both regression and classification tasks, decision trees usually suffer from high variance . This means that if we split the dataset into two parts at random and fit a decision tree to the two halves, we will get quite different results. Thus we need one approach to rece variance at the expense of bias.

Bagging, which is designed for the context, is just the procere for recing the variance of weak models. In bagging , a random sample of data in the training set is selected with replacement - which means indivial data points can be chosen more than once - and then we fit a weak learner, such as a decision tree, to each of the sample data. Finally, we aggregate the predictions of base learners to get a more accurate estimate.

We build B distinct sample datasets from the train set using bootstrapped training data sets, and calculate the prediction using B separate training sets, and average them in order to obtain a low-variance statistical model:

While bagging can rece the variance for many models, it is particularly useful for decision trees. To apply bagging, we simply construct B separate bootstrapped training sets and train B indivial decision trees on these training sets. Each tree can grow very deep and not be pruned, thus they have high variance but low bias. Hence, averaging these trees can rece variance.

Bagging has three steps to complete: bootstraping, parallel training, and aggregating.

There are a number of key benefits for bagging, including:

The key disadvantages of bagging are:

Now we practice how to use bagging to improve the performance of models. The scikit-learn Python machine learning library provides easy access to the bagging method.

First, we use make_classification function to construct the classification dataset for practice of the bagging problem.

Here, we make a binary problem dataset with 1000 observations and 30 input features.

(2250, 30) (750, 30) (2250,) (750,)

To demonstrate the benefits of bagging model, we first build one decision tree and compare it to bagging model.

Now we begin construct an ensemble model using bagging technique.

Based on the result, we can easily find that the ensemble model reces both bias(higher accuracy) and variance(lower std). Bagging model's accuracy is 0.066 higher than that of one single decision tree.

Make Prediction

BaggingClasifier can make predictions for new cases using the function predict .

Then we build a bagging model for the regression model. Similarly, we use make_regression function to make a dataset about the regression problem.

As we did before, we still use repeated k-fold cross-validation to evaluate the model. But one thing is different than the case of classification. The cross-validation feature expects a utility function rather than a cost function. In other words, the function thinks being greater is better rather than being smaller.

The scikit-learn package will make the metric, such as neg_mean_squared_erro negative so that is maximized instead of minimized. This means that a larger negative MSE is better. We can add one "+" before the score.

The mean squared error for decision tree is and variance is .

On the other hand, a bagging regressor performs much better than one single decision tree. The mean squared error is and variance is . The bagging reces both bias and variance.

In this section, we explore how to tune the hyperparameters for the bagging model.

We demonstrate this by performing a classification task.

Recall that the bagging is implemented by building a number of bootstrapped samples, and then building a weak learner for each sample data. The number of models we build corresponds to the parameter n_estimators .

Generally, the number of estimators can increase constantly until the performance of the ensemble model converges. And it is worth noting that using a very large number of n_estimators will not lead to overfitting.

Now let's try a different number of trees and examine the change in performance of the ensemble model.

Number of Trees 10: 0.862 0.038
Number of Trees 50: 0.887 0.025
Number of Trees 100: 0.888 0.027
Number of Trees 200: 0.89 0.027
Number of Trees 300: 0.888 0.027
Number of Trees 500: 0.888 0.028
Number of Trees 1000: 0.892 0.027
Number of Trees 2000: 0.889 0.029

Let's look at the distribution of scores

In this case, we can see that the performance of the bagging model converges to 0.888 when we grow 100 trees. The accuracy becomes flat after 100.

Now let's explore the number of samples in bootstrapped dataset. The default is to create the same number of samples as the original train set.

Number of Trees 0.1: 0.801 0.04
Number of Trees 0.2: 0.83 0.039
Number of Trees 0.30000000000000004: 0.849 0.029
Number of Trees 0.4: 0.842 0.031
Number of Trees 0.5: 0.856 0.039
Number of Trees 0.6: 0.866 0.037
Number of Trees 0.7000000000000001: 0.856 0.033
Number of Trees 0.8: 0.868 0.036
Number of Trees 0.9: 0.866 0.025
Number of Trees 1.0: 0.865 0.035

Similarly, look at the distribution of scores

The rule of thumb is that we set the max_sample to 1, but this does not mean all training observations will be selected from the train set. Since we leverage bootstrapping technique to select data from the training set at random with replacement, only about 63% of training instances are sampled on average on each predictor, while the remaining 37% of training instances are not sampled and thus called out-of-bag instances.

Since the ensemble predictor never sees the oob samples ring training, it can be evaluated on these instances, without additional need for cross-validation after training. We can use out-of-bag evaluation in scikit-learn by setting oob_score=True .

Let's try to use the out-of-bag score to evaluate a bagging model.

According to this oob evaluation, this BaggingClassifier is likely to achieve about 87.6% accuracy on the test set. Let’s verify this:

The BaggingClassifier class supports sampling the features as well. This is controlled by two hyperparameters: max_features and bootstrap_features. They work the same way as max_samples and bootstrap, but for feature sampling instead of instance sampling. Thus, each predictor will be trained on a random subset of the input features.

The random sampling of features is particularly useful for high-dimensional inputs, such as images. Randomly sampling both features and instances is called Random Patches . On the other hand, keeping all instances( bootstrap=False,max_sample=1.0 ) and sampling features( bootstrap_features=True,max_features smaller than 1.0 ) is called Random Subspaces.

Random subspaces ensemble is an extension to bagging ensemble model. It is created by a subset of features in the training set. Very similar to Random Forest , random subspace ensemble is different from it in only two aspects:

Sampling features results in even more predictor diversity, trading a bit more bias for a lower variance.

Reference:

⑨ 数据挖掘实战之随机森林算法使用

阅读路线:

近来有同学问道，有没有数据挖掘的案例可以来练习下，主要是来通过案例来知道算法是如何使用的。

下面就以 港股打新 这个金融项目为例，来做个预测，先来说下什么是打新;打新，就是用资金参与新股申购，如果中签的话，就买到了即将上市的股票。

此次分析的目的是为了深入打新数据，找到最优算法，挖掘出影响打新的关键因素，找到可能要破发的新股,从而减少新股破发的风险，提高盈利。

打新的本质，也即是在股票上市后卖出，赚取其中的差价。一般在买到的第一天就会卖掉，当然第一天上升的股票有涨有跌，为了能够减少风险，会根据历史数据来做个预判，这里把涨幅10%以下记为0，涨幅10%以上记为1,很明显这也是二分类的预测问题

对于本项目而言，最终的评价标准是要求在精确度达到97%的情况下，最大化召回率。这里是要求尽可能提高召回率,自己本身对风险比较厌恶，宁可错杀，也不会愿意申购上市就要的破发的新股

对于评价标准，一般用的是PR曲线和ROC曲线。ROC曲线有一个突出优势，就是不受样本不均衡的影响 ROC曲线不受样本不均衡问题的影响

1.数据总体情况

港股数据主要来自两个方面, 利弗莫尔证券数据和阿思达克保荐人近两年数据，处理之后是这样的:

数据一共有17个特征，除了目标变量is_profit,还有16个特征。

以上的数据指标可以梳理为两类，一类是股票相，如关，一类是保荐人指标，

2.数据处理方面不用管
一般特征工程主要从以下方面来进行:衍生特征、异常值处理、缺失值处理、连续特征离散化、分类变量one-hot-encode、标准化等，本篇文章主要讲解随机森林算法使用，暂不对特征工程做过多的展示了

从 使用随机森林默认的参数 带来的模型结果来看，auc指标是0.76，效果还可以。

为了更好的理解上述，这里有几个知识点需要来解释下:

返回的是一个n行k列的数组，第i行第j列上的数值是模型预测第i个预测样本的标签为j的概率。所以每一行的和应该等于1;本文中predict_proba(x_test)[:,1]返回的是标签为0的概率。

(a).混淆矩阵

混淆矩阵如下图分别用”0“和”1“代表负样本和正样本。FP代表实际类标签为”0“，但预测类标签为”1“的样本数量。其余，类似推理。

(b).假正率和真正率

假正率（False Positive Rate，FPR）是实际标签为”0“的样本中，被预测错误的比例。真正率（True Positive Rate，TPR）是实际标签为”1“的样本中，被预测正确的比例。其公式如下：

(3).ROC曲线

下图的黑色线即为ROC曲线，ROC曲线是一系列threshold下的（FPR，TPR）数值点的连线。此时的threshold的取值分别为测试数据集中各样本的预测概率。但，取各个概率的顺序是从大到小的。然后也就是有了不同的RPR、TPR，且测试样本中的数据点越多，曲线越平滑：

AUC(Area Under roc Cure)，顾名思义，其就是ROC曲线下的面积，在此例子中AUC=0.62。AUC越大，说明分类效果越好。

下面我们来看看RF重要的Bagging框架的参数，主要有以下几个:

(1) n_estimators:

也就是最大的弱学习器的个数。一般来说n_estimators太小，容易欠拟合，n_estimators太大，计算量会太大，并且n_estimators到一定的数量后，再增大n_estimators获得的模型提升会很小，所以一般选择一个适中的数值。默认是100。

(2) oob_score:

即是否采用袋外样本来评估模型的好坏。默认识False。个人推荐设置为True，因为袋外分数反应了一个模型拟合后的泛化能力。

(3) criterion:

即CART树做划分时对特征的评价标准。分类模型和回归模型的损失函数是不一样的。分类RF对应的CART分类树默认是基尼系数gini,另一个可选择的标准是信息增益。回归RF对应的CART回归树默认是均方差mse，另一个可以选择的标准是绝对值差mae。一般来说选择默认的标准就已经很好的。

从上面可以看出，RF重要的框架参数比较少，主要需要关注的是 n_estimators ，即RF最大的决策树个数。

下面我们再来看RF的决策树参数，它要调参的参数如下:

(1) RF划分时考虑的最大特征数max_features:

(2) 决策树最大深度max_depth:

默认可以不输入，如果不输入的话，决策树在建立子树的时候不会限制子树的深度。一般来说，数据少或者特征少的时候可以不管这个值。如果模型样本量多，特征也多的情况下，推荐限制这个最大深度，具体的取值取决于数据的分布。常用的可以取值10-100之间。

(3) 内部节点再划分所需最小样本数min_samples_split:

这个值限制了子树继续划分的条件，如果某节点的样本数少于min_samples_split，则不会继续再尝试选择最优特征来进行划分。默认是2.如果样本量不大，不需要管这个值。如果样本量数量级非常大，则推荐增大这个值。

(4) 叶子节点最少样本数min_samples_leaf:

这个值限制了叶子节点最少的样本数，如果某叶子节点数目小于样本数，则会和兄弟节点一起被剪枝。默认是1,可以输入最少的样本数的整数，或者最少样本数占样本总数的百分比。如果样本量不大，不需要管这个值。如果样本量数量级非常大，则推荐增大这个值。

(5)叶子节点最小的样本权重和min_weight_fraction_leaf：

这个值限制了叶子节点所有样本权重和的最小值，如果小于这个值，则会和兄弟节点一起被剪枝。默认是0，就是不考虑权重问题。一般来说，如果我们有较多样本有缺失值，或者分类树样本的分布类别偏差很大，就会引入样本权重，这时我们就要注意这个值了。

(6) 最大叶子节点数max_leaf_nodes:

通过限制最大叶子节点数，可以防止过拟合，默认是"None”，即不限制最大的叶子节点数。如果加了限制，算法会建立在最大叶子节点数内最优的决策树。如果特征不多，可以不考虑这个值，但是如果特征分成多的话，可以加以限制，具体的值可以通过交叉验证得到。

(7) 节点划分最小不纯度min_impurity_split:
这个值限制了决策树的增长，如果某节点的不纯度(基于基尼系数，均方差)小于这个阈值，则该节点不再生成子节点。即为叶子节点。一般不推荐改动默认值1e-7。

上面决策树参数中最重要的包括最大特征数 max_features ，最大深度 max_depth ，内部节点再划分所需最小样本数 min_samples_split 和叶子节点最少样本数 min_samples_leaf

GridSearchCV的名字其实可以拆分为两部分，GridSearch和CV，即网格搜索和交叉验证。这两个名字都非常好理解。网格搜索，搜索的是参数，即在指定的参数范围内，按步长依次调整参数，利用调整的参数训练学习器，从所有的参数中找到在验证集上精度最高的参数，这其实是一个训练和比较的过程。

GridSearchCV可以保证在指定的参数范围内找到精度最高的参数，但是这也是网格搜索的缺陷所在，他要求遍历所有可能参数的组合，在面对大数据集和多参数的情况下，非常耗时。

通过RF框架以及RF决策树参数能够了解到重点需要调节以下的参数

主要需要关注的是 n_estimators ，即RF最大的决策树个数。

决策树参数中最重要的包括最大特征数 max_features ，最大深度 max_depth ，内部节点再划分所需最小样本数 min_samples_split 和叶子节点最少样本数 min_samples_leaf

输出结果为:

6.3最佳的弱学习器迭代次数，接着我们对决策树最大深度max_depth和内部节点再划分所需最小样本数min_samples_split进行网格搜索

输出结果

6.4最大特征数max_features做调参

输出结果:

6.5根据模型最佳参数进行测试

输出结果:0.7805947388486466，相比没有调参前，模型有不少的提高的，方便观察，用图形来看下ROC曲线图

6.6观察模型的重要特征

6.7最大召回率

最后得出我们想要的结果，精准率在 0.97 下，召回率 0.046

参考文章:

1.sklearn中predict_proba的用法例子

2.Python机器学习笔记 Grid SearchCV(网格搜索)

3.scikit-learn随机森林调参小结

4.ROC曲线和AUC值的计算

导航:首页 > 编程语言 > pythonoob

pythonoob

与pythonoob相关的资料