Decision Tree

2019-11-14 08:56:11

字體：大中小

供稿：網(wǎng)友

Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier as DTCy = df.targetX = df.featuresdtc = DTC(criterion='entropy', mim_samples_slit=20, random_state=90)dtc.fit(X, y)

official example

from sklearn.datasets import load_irisfrom sklearn.model_selection import cross_val_scorefrom sklearn.tree import DecisionTreeClassifierclf = DecisionTreeClassifier(random_state=0)iris = load_iris()cross_val_score(clf, iris.data, iris.target, cv=10)

Visualizing the tree

article

in advance you should install Graphviz

from sklearn.tree import export_graphvizdef visualize_tree(tree, feature_names): """Create tree png using graphviz. Args ---- tree -- scikit-learn DecsisionTree. feature_names -- list of feature names. usage --- features = X.columns visualize_tree(dtc, features) """ with open("dt.dot", 'w') as f: export_graphviz(tree, out_file=f, feature_names=feature_names) #generate png command = ["dot", "-Tpng", "dt.dot", "-o", "dt.png"] #or pdf #command = ["dot", "-Tpdf", "dt.dot", "-o", "dt.pdf"] try: subPRocess.check_call(command) except: exit("Could not run dot, ie graphviz, to " "produce visualization") #open image from PIL import Image im = Image.open("od.png") im.show()

Decision Tree Regression

example

DecisionTreeRegressor

Decision Tree Regression with AdaBoost

from sklearn.tree import DecisionTreeRegressorregr = DecisionTreeRegressor(max_depth=2)regr.fit(X, y)y_predict = regr_1.predict(X_test)

ID3 (Iterative Dichotomiser)

屬性集合A={a1,a2,…,am} $A=/{a_1,a_2,/dots,a_m/}$ 如{身高,體重,是否近視}

樣本集合D={(x1;y1),(x2;y2),…,(xm;ym)} $D=/{(x_1;y_1),(x_2;y_2),/dots,(x_m;y_m)/}$ 如{(身高175,體重63,近視1;不符合應(yīng)聘要求0),…}

根據(jù)某屬性a的劃分D1,D2,… $D^1,D^2,/dots$

informathin entropy

Ent(D)=?∑k=1|m|pklog2pk $Ent(D)=-/sum/limits_{k=1}^{|m|}p_klog_2p_k$

pk $p_k$ 是每類樣本占當前樣本集合D $D$ 中的比例

Ent越小純度越高

決策樹根節(jié)點的D包含所有樣本，如果y只有0,1兩個取值，正3個負2個，則 Ent(D)=?(25log225+35log235) $Ent(D)=-(/frac{2}{5}log_2/frac{2}{5}+/frac{3}{5}log_2/frac{3}{5})$

information gain

根據(jù)某屬性a劃分得到Dv(v=1,2,…,V) $D^v(v=1,2,/dots,V)$ Gain(D,a)=Ent(D)?∑v=1V|Dv||D|Ent(Dv) $Gain(D,a)=Ent(D)-/sum/limits_{v=1}^V/frac{|D^v|}{|D|}Ent(D^v)$

Gain越大劃分得到的純度提升越高

example

假設(shè)有A = {行為習慣，飲食偏好, 體育運動}三個屬性，判斷是否會得某種病。

總共6個得病9個不得

行為習慣	得病	不得病	得病占該習慣總數(shù)比例	該行為習慣占總?cè)藬?shù)的比例
抽煙	1	5	1/6	6/15
喝酒	2	3	2/5	5/15
吸毒	3	1	3/4	4/15

Ent(D)=?(615log265+95log295) $Ent(D)=-(/frac{6}{15}log_2/frac{6}{5}+/frac{9}{5}log_2/frac{9}{5})$ 根據(jù)行為習慣劃分出抽煙，喝酒，吸毒三個子集D1,D2,D3 $D^1,D^2,D^3$ Ent(D1)=?(16log216+56log256)Ent(D2),Ent(D3)同理 $Ent(D^1)=-(/frac{1}{6}log_2/frac{1}{6}+/frac{5}{6}log_2/frac{5}{6})//Ent(D^2),Ent(D^3)同理$

Gain(D,行為習慣)=Ent(D)?(615Ent(D1)+515Ent(D2)+415Ent(D3)) $Gain(D,行為習慣) = Ent(D) - (/frac{6}{15}Ent(D^1)+/frac{5}{15}Ent(D^2)+/frac{4}{15}Ent(D^3))$

之后再算Gain(D,飲食偏好) $Gain(D,飲食偏好)$

假設(shè)Gain(D,行為習慣)>Gain(D,飲食偏好)>Gain(D,體育運動) $Gain(D,行為習慣)>Gain(D,飲食偏好)>Gain(D,體育運動)$

那么分別取D1,D2,D3 $D^1,D^2,D^3$ 為新的D，剩下的屬性為A={飲食偏好，體育運動} ，進行迭代算Gain(D,飲食偏好) $Gain(D,飲食偏好)$ 和Gain(D,體育運動) $Gain(D,體育運動)$