分类数据的预处理

在实际中，采集的数据往往不完整、不一致，并可能包含许多错误。数据预处理 (Data Preprocessing) 是一种数据挖掘技术，对原始数据进行处理以便进一步分析。

本文介绍分类数据 (Categorical Data) 的处理。

定类和定序

在处理分类数据时，需要区分定类 (nominal) 特征和定序 (ordinal) 特征。

定类特征：不同类别，相互间比较没有意义。如姓名，性别，水果等。
定序特征：不同类别，相互间可以比较排序。如非常满意/一般满意/不满意，小型/中型/大型等。和数字特征不同，两者之差一般没有意义。

以下的 df 变量代表了 T 恤的一些特征：

>>> import pandas as pd
>>> df = pd.DataFrame([
... ['green', 'M', 10.1, 'class1'],
... ['red', 'L', 13.5, 'class2'],
... ['blue', 'XL', 15.3, 'class1']])
>>> df.columns = ['color', 'size', 'price', 'classlabel']
>>> df
   color size  price classlabel
0  green    M   10.1     class1
1    red    L   13.5     class2
2   blue   XL   15.3     class1

其中包括定类特征 color（颜色）、定序特征 size（尺码） 和数字特征 price（价格）。最后一列为分类类别 label。

定序特征的映射

为了确保学习算法能够识别定序特征，需要手动将分类字符串映射 (Mapping) 为整型。

如上例的 T 恤尺码，假设已知排序 $XL > L > M$ ，可以进行如下转换：

>>> size_mapping = {
... 'XL': 3,
... 'L': 2,
... 'M': 1}
>>> df['size'] = df['size'].map(size_mapping)
>>> df
   color  size  price classlabel
0  green     1   10.1     class1
1    red     2   13.5     class2
2   blue     3   15.3     class1

对于反向转换，创建反向词典然后进行 map 即可：

>>> inv_size_mapping = {v: k for k, v in size_mapping.items()}
>>> df['size'].map(inv_size_mapping)
0     M
1     L
2    XL
Name: size, dtype: object

类标签的编码

许多机器学习库要求类标签编码 (Encoding) 为整数值；虽然 scikit-learn 已默认集成了此处理机制，但是建议养成手动转换的习惯。

类标签的数字大小没有任何意义，因此可以直接使用枚举进行标签转换：

>>> import numpy as np
>>> class_mapping = {label:idx for idx,label in
... enumerate(np.unique(df['classlabel']))}
>>> class_mapping
{'class1': 0, 'class2': 1}

将类标签编码为整数：

>>> df['classlabel'] = df['classlabel'].map(class_mapping)
>>> df
   color  size  price  classlabel
0  green     1   10.1           0
1    red     2   13.5           1
2   blue     3   15.3           0

反向转换：

>>> inv_class_mapping = {v: k for k, v in class_mapping.items()}
>>> df['classlabel'] = df['classlabel'].map(inv_class_mapping)
>>> df
   color  size  price classlabel
0  green     1   10.1     class1
1    red     2   13.5     class2
2   blue     3   15.3     class1

通过 sklearn.preprocessing.LabelEncoder 可以更简便地将类标签编码为整数：

>>> from sklearn.preprocessing import LabelEncoder
>>> class_le = LabelEncoder()
>>> y = class_le.fit_transform(df['classlabel'].values)
>>> y
array([0, 1, 0])

反向转换：

>>> class_le.inverse_transform(y)
array(['class1', 'class2', 'class1'], dtype=object)

定类特征的独热编码

独热编码的原理

在介绍独热编码 (One-Hot Encoding) 之前，先说明一下为什么不用之前章节的编码方式。

如果按照之前的方式进行编码：

>>> X = df[['color', 'size', 'price']].values
>>> color_le = LabelEncoder()
>>> X[:, 0] = color_le.fit_transform(X[:, 0])
>>> X
array([[1, 1, 10.1],
       [2, 2, 13.5],
       [0, 3, 15.3]], dtype=object)

编码结果为：

blue = 0
green = 1
red = 2

如果把上述数据提供给分类器，则会发生处理分类数据的最常见错误之一：虽然我们知道 0、1、2 这些数字不代表大小，但是算法并不知道。因此算法在学习过程中，会默认将其关联起来，即假定 red > green > blue。这样处理后，算法仍然能产生一定的结果，但其性能会受影响。

独热编码的思想为为每一个值创建一个新的特征。对于上述的例子，可以把颜色特征转换为三个新的特征：blue、green 和 red，然后使用二进制值标记。对于 blue 样本而言，编码为 blue=1, green=0, red=0。

独热编码的实现

使用 sklearn.preprocessing.OneHotEncoder 对特征 color 进行编码，返回一个稀疏矩阵：

>>> from sklearn.preprocessing import OneHotEncoder
>>> ohe = OneHotEncoder(categorical_features=[0])
>>> ohe.fit_transform(X).toarray()
array([[  0. ,   1. ,   0. ,   1. ,  10.1],
       [  0. ,   0. ,   1. ,   2. ,  13.5],
       [  1. ,   0. ,   0. ,   3. ,  15.3]])

另一个更方便的独热编码方法是 pandas 中的 get_dummies 方法，转换 DataFrame 的指定字符串列，其他列保持不变：

>>> pd.get_dummies(df[['price', 'color', 'size']])
   price  size  color_blue  color_green  color_red
0   10.1     1           0            1          0
1   13.5     2           0            0          1
2   15.3     3           1            0          0

独热编码的相关性

当使用热门的编码数据集时，必须记住它引入了多重共线性，即某个变量可以由其他变量线性预测得到 (如上面的矩阵，若已知 blue、green、red 中的任意两个，可以得到最后一个)。这会对某些操作 (如矩阵求逆) 造成影响。

为了减少变量之间的相关性，我们可以简单地从独热编码数组中删除一个特征列。

sklearn.preprocessing.OneHotEncoder 不提供特征列删除方法，需要转换为 numpy 数组后进行切片：

>>> ohe = OneHotEncoder(categorical_features=[0])
>>> ohe.fit_transform(X).toarray()[:, 1:]
array([[ 1. , 0. , 1. , 10.1],
[ 0. , 1. , 2. , 13.5],
[ 0. , 0. , 3. , 15.3]])

pandas 中的 get_dummies 提供参数 drop_first，可以很方便地删除首个特征列：

>>> pd.get_dummies(df[['price', 'color', 'size']],
... drop_first=True)
   price  size  color_green  color_red
0   10.1     1            1          0
1   13.5     2            0          1
2   15.3     3            0          0

定类和定序​

定序特征的映射​

类标签的编码​

定类特征的独热编码​

独热编码的原理​

独热编码的实现​

独热编码的相关性​