干货|干货|教你一文掌握数据预处理

干货|教你一文掌握数据预处理

2020-04-14 09:00:50 作者：佚名

原标题：干货 | 教你一文掌握数据预处理

数据分析一定少不了数据预处理，预处理的好坏决定了后续的模型效果，今天我们就来看看预处理有哪些方法呢？

记录实战过程中在数据预处理环节用到的方法~

主要从以下几个方面介绍：

常用方法
Numpy部分
Pandas部分
Sklearn 部分
处理文本数据

一、常用方法1、生成随机数序列randIndex = random.sample(range(trainSize, len(trainData_copy)), 5*trainSize)

2、计算某个值出现的次数titleSet = set(titleData)

fori intitleSet:

count = titleData.count(i)

用文本出现的次数替换非空的地方。词袋模型 Word Count

titleData = allData[ 'title']

titleSet = set(list(titleData))

title_counts = titleData.value_counts

fori intitleSet:

ifisNaN(i):

continue

count = title_counts[i]

titleData.replace(i, count, axis= 0, inplace= True)

title = pd.DataFrame(titleData)

allData[ 'title'] = title

3、判断值是否为NaNdefisNaN(num):

returnnum != num

4、 Matplotlib在jupyter中显示图像%matplotlib inline

5、处理日期birth = trainData[ 'birth_date']

birthDate = pd.to_datetime(birth)

end = pd.datetime( 2020, 3, 5)

# 计算天数

birthDay = end - birthDate

birthDay.astype( 'timedelta64[D]')

# timedelta64 转到 int64

trainData[ 'birth_date'] = birthDay.dt.days

6、计算多列数的平均值等trainData[ 'operate_able'] = trainData.iloc[ : , 20: 53].mean(axis= 1)

trainData[ 'local_able'] = trainData.iloc[ : , 53: 64].mean(axis= 1)

7、数据分列（对列进行one-hot）train_test = pd.get_dummies(train_test,columns=[ "Embarked"])

train_test = pd.get_dummies(train_test,columns = [ 'SibSp', 'Parch', 'SibSp_Parch'])

8、正则提取指定内容

df['Name].str.extract是提取函数,配合正则一起使用

train_test[ 'Name1'] = train_test[ 'Name'].str.extract( '.+,(.+)').str.extract( '^(.+?).').str.strip

9、根据数据是否缺失进行处理train_test.loc[train_test[ "Age"].isnull , "age_nan"] = 1

train_test.loc[train_test[ "Age"].notnull , "age_nan"] = 0

10、按区间分割-数据离散化

返回x所属区间的索引值，半开区间

#将年龄划分五个阶段10以下,10-18,18-30,30-50,50以上

train_test[ 'Age'] = pd.cut(train_test[ 'Age'], bins=[ 0, 10, 18, 30, 50, 100],labels=[ 1, 2, 3, 4, 5])

二、Numpy部分1、where索引列表delLocal = np.array(np.where(np.array(trainData[ 'acc_now_delinq']) == 1))

2、permutation(x) 随机生成一个排列或返回一个range

如果x是一个多维数组，则只会沿着它的第一个索引进行混洗。

importnumpy asnp

shuffle_index = np.random.permutation( 60000)

X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

3、numpy.argmax 返回沿轴的最大值的`索引`

返回沿轴的最大值的索引。

np.argmax(some_digit_scores)

a : array_like; 输入数组
axis : int, optional; 默认情况下，索引是放在平面数组中，否则沿着指定的轴。
out : array, optional; 如果提供，结果将被插入到这个数组中。它应该是适当的形状和dtype。

4、numpy.dot(a, b, out=None) 计算两个数组的点积>>> np.dot( 3, 4)

5、numpy.random.randn 从标准正太分布返回样本>>> np.random.seed( 42) # 可设置随机数种子

>>> theta = np.random.randn( 2, 1)

array([[ 4.21509616],

[ 2.77011339]])

参数

d0, d1, …, dn : int, optional；返回的数组维度，应该都是正值。如果没有给出，将返回一个Python float值。

6、numpy.linspace 在指定区间返回间隔均匀的样本[start, stop]X_new=np.linspace( -3, 3, 100).reshape( 100, 1)

X_new_poly = poly_features.transform(X_new)

y_new = lin_reg.predict(X_new_poly)

plt.plot(X, y, "b.")

plt.plot(X_new, y_new, "r-", linewidth= 2, label= "Predictions")

plt.xlabel( "$x_1$", fontsize= 18)

plt.ylabel( "$y$", rotation= 0, fontsize= 18)

plt.legend(loc= "upper left", fontsize= 14)

plt.axis([ -3, 3, 0, 10])

save_fig( "quadratic_predictions_plot")

plt.show

start : scalar；序列的起始值
stop : scalar；序列的结束值
num : int, optional；要生成的样本数量，默认为50个。
endpoint : bool, optional；若为True则包括结束值，否则不包括结束值，即[start, stop)区间。默认为True。
dtype : dtype, optional；输出数组的类型，若未给出则从输入数据推断类型。

三、Pandas部分1、Jupyter notebook中设置最大显示行列数pd.set_option( 'display.max_columns', 64)

pd.set_option( 'display.max_rows', 1000000)

2、读入数据homePath = 'game'

trainPath = os.path.join(homePath, 'train.csv')

testPath = os.path.join(homePath, 'test.csv')

trainData = pd.read_csv(trainPath)

testData = pd.read_csv(testPath)

3、数据简单预览

~head 获取前五行数据，供快速参考。
~info 获取总行数、每个属性的类型、非空值的数量。
~value_counts 获取每个值出现的次数
~hist 直方图的形式展示数值型数据
~describe 简要显示数据的数字特征；例如：总数、平均值、标准差、最大值最小值、25%/50%/75%值

4、拷贝数据mthsMajorTest = fullData.copy

5、数据相关性

计算相关性矩阵

corrMatrix = trainData.corr

corrMatrix[ 'acc_now_delinq'].sort_values(ascending= False) # 降序排列

相关系数矩阵图

importnumpy

correlations = data.corr #计算变量之间的相关系数矩阵

# plot correlation matrix

fig = plt.figure #调用figure创建一个绘图对象

ax = fig.add_subplot( 111)

cax = ax.matshow(correlations, vmin= -1, vmax= 1) #绘制热力图，从-1到1

fig.colorbar(cax) #将matshow生成热力图设置为颜色渐变条

ticks = numpy.arange( 0, 9, 1) #生成0-9，步长为1

ax.set_xticks(ticks) #生成刻度

ax.set_yticks(ticks)

ax.set_xticklabels(names) #生成x轴标签

ax.set_yticklabels(names)

plt.show

颜色越深表明二者相关性越强

6、删除某列trainData.drop( 'acc_now_delinq', axis= 1, inplace= True)

# 此方法并不会从内存中释放内存

delfullData[ 'member_id']

7、列表类型转换termData = list(map(int, termData))

8、替换数据gradeData.replace([ 'A', 'B', 'C', 'D', 'E', 'F', 'G'], [ 7, 6, 5, 4, 3, 2, 1],inplace= True)

9、数据集合并allData = trainData.append(testData)

allData = pd.concat([trainData, testData], axis= 0, ignore_index= True)

10、分割termData = termData.str.split( ' ', n= 2, expand= True)[ 1]

11、~where 相当于三目运算符( ? : )

通过判断自身的值来修改自身对应的值，相当于三目运算符( ? : )

housing[ "income_cat"].where(housing[ "income_cat"] < 5, 5.0, inplace= True)

cond 如果为True则保持原始值，若为False则使用第二个参数other替换值。
other 替换的目标值
inplace 是否在数据上执行操作

12、np.ceil(x, y) 限制元素范围

x 输入的数据
y float型，每个元素的上限

housing[ "income_cat"] = np.ceil(housing[ "median_income"] / 1.5) # 每个元素都除1.5

13、~loc[] 纯粹基于标签位置的索引器strat_train_set = housing.loc[train_index]

strat_test_set = housing.loc[test_index]

14、~dropna 返回略去丢失数据部分后的剩余数据sample_incomplete_rows.dropna(subset=[ "total_bedrooms"])

15、~fillna 用指定的方法填充# 用中位数填充

median = housing[ "total_bedrooms"].median

sample_incomplete_rows[ "total_bedrooms"].fillna(median, inplace= True)

16、重置索引allData = subTrain.reset_index

四、Sklearn 部分1、数据标准化fromsklearn.preprocessing importStandardScaler

ss = StandardScaler

ss.fit(mthsMajorTrain)

mthsMajorTrain_d = ss.transform(mthsMajorTrain)

mthsMajorTest_d = ss.transform(mthsMajorTest)

2、预测缺失值fromsklearn importlinear_model

lin = linear_model.BayesianRidge

lin.fit(mthsMajorTrain_d, mthsMajorTrainLabel)

trainData.loc[(trainData[ 'mths_since_last_major_derog'].isnull), 'mths_since_last_major_derog'] = lin.predict(mthsMajorTest_d)

3、Lightgbm提供的特征重要性importlightgbm aslgb

params = {

'task': 'train',

'boosting_type': 'gbdt',

'objective': 'regression',

'metric': { 'l2', 'auc'},

'num_leaves': 31,

'learning_rate': 0.05,

'feature_fraction': 0.9,

'bagging_fraction': 0.8,

'bagging_freq': 5,

'verbose': 0

}

lgb_train = lgb.Dataset(totTrain[: 400000], totLabel[: 400000])

lgb_eval = lgb.Dataset(totTrain[ 400000:], totLabel[ 400000:])

gbm = lgb.train(params,

lgb_train,

num_boost_round= 20,

valid_sets=lgb_eval,

early_stopping_rounds= 5)

lgb.plot_importance(gbm, figsize=( 10, 10))

对于缺失值，一般手动挑选几个重要的特征，然后进行预测

upFeatures = [ 'revol_util', 'revol_bal', 'annual_inc'] # 通过上一步挑选出的特征

totTrain = totTrain[upFeatures]

totTest = trainData.loc[(trainData[ 'total_rev_hi_lim'].isnull)][upFeatures]

totTest[ 'annual_inc'].fillna( -9999, inplace= True)

fromsklearn.preprocessing importStandardScaler

ss = StandardScaler

ss.fit(totTrain)

train_d = ss.transform(totTrain)

test_d = ss.transform(totTest)

fromsklearn importlinear_model

lin = linear_model.BayesianRidge

lin.fit(train_d, totLabel)

trainData.loc[(trainData[ 'total_rev_hi_lim'].isnull), 'total_rev_hi_lim'] = lin.predict(test_d)

4、用中位数填充trainData[ 'total_acc'].fillna(trainData[ 'total_acc'].median, inplace= True)

5、用均值填充trainData[ 'total_acc'].fillna(trainData[ 'total_acc'].mean, inplace= True)

6、Imputer 处理丢失值

各属性必须是数值

fromsklearn.preprocessing importImputer

# 指定用何值替换丢失的值，此处为中位数

imputer = Imputer(strategy= "median")

# 使实例适应数据

imputer.fit(housing_num)

# 结果在statistics_ 变量中

imputer.statistics_

# 替换

X = imputer.transform(housing_num)

housing_tr = pd.DataFrame(X, columns=housing_num.columns,

index = list(housing.index.values))

# 预览

housing_tr.loc[sample_incomplete_rows.index.values]

五、处理文本数据1、pandas.factorize 将输入值编码为枚举类型或分类变量housing_cat = housing[ 'ocean_proximity']

housing_cat.head( 10)

# 输出

# 17606 <1H OCEAN

# 18632 <1H OCEAN

# 14650 NEAR OCEAN

# 3230 INLAND

# 3555 <1H OCEAN

# 19480 INLAND

# 8879 <1H OCEAN

# 13685 INLAND

# 4937 <1H OCEAN

# 4861 <1H OCEAN

# Name: ocean_proximity, dtype: object

housing_cat_encoded, housing_categories = housing_cat.factorize

housing_cat_encoded[: 10]

# 输出

# array([0, 0, 1, 2, 0, 2, 0, 2, 0, 0], dtype=int64)

2、参数

values : ndarray (1-d)；序列
sort : boolean, default False；根据值排序
na_sentinel : int, default -1；给未找到赋的值
size_hint : hint to the hashtable sizer

3、返回值

labels : the indexer to the original array
uniques : ndarray (1-d) or Index；当传递的值是Index或Series时，返回独特的索引。

4、OneHotEncoder 编码整数特征为one-hot向量

返回值为稀疏矩阵

fromsklearn.preprocessing importOneHotEncoder

encoder = OneHotEncoder

housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape( -1, 1))

housing_cat_1hot

注意 fit_transform 期望一个二维数组，所以这里将数据reshape了。

5、处理文本特征示例housing_cat = housing[ 'ocean_proximity']

housing_cat.head( 10)

# 17606 <1H OCEAN

# 18632 <1H OCEAN

# 14650 NEAR OCEAN

# 3230 INLAND

# 3555 <1H OCEAN

# 19480 INLAND

# 8879 <1H OCEAN

# 13685 INLAND

# 4937 <1H OCEAN

# 4861 <1H OCEAN

# Name: ocean_proximity, dtype: object

housing_cat_encoded, housing_categories = housing_cat.factorize

housing_cat_encoded[: 10]

# array([0, 0, 1, 2, 0, 2, 0, 2, 0, 0], dtype=int64)

housing_categories

# Index(['<1H OCEAN', 'NEAR OCEAN', 'INLAND', 'NEAR BAY', 'ISLAND'], dtype='object')

fromsklearn.preprocessing importOneHotEncoder

encoder = OneHotEncoder

print(housing_cat_encoded.reshape( -1, 1))

housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape( -1, 1))

housing_cat_1hot

# [[0]

# [0]

# [1]

# ...,

# [2]

# [0]

# [3]]

# <16512x5 sparse matrix of type '<class 'numpy.float64'>'

# with 16512 stored elements in Compressed Sparse Row format>

6、LabelEncoder 标签编码

LabelEncoder`是一个可以用来将标签规范化的工具类，它可以将标签的编码值范围限定在[0,n_classes-1]。简单来说就是对不连续的数字或者文本进行编号。

>>> fromsklearn importpreprocessing

>>> le = preprocessing.LabelEncoder

>>> le.fit([ 1, 2, 2, 6])

LabelEncoder

>>> le.classes_

array([ 1, 2, 6])

>>> le.transform([ 1, 1, 2, 6])

array([ 0, 0, 1, 2])

>>> le.inverse_transform([ 0, 0, 1, 2])

array([ 1, 1, 2, 6])

当然，它也可以用于非数值型标签的编码转换成数值标签（只要它们是可哈希并且可比较的）:

>>> le = preprocessing.LabelEncoder

>>> le.fit([ "paris", "paris", "tokyo", "amsterdam"])

LabelEncoder

>>> list(le.classes_)

[ 'amsterdam', 'paris', 'tokyo']

>>> le.transform([ "tokyo", "tokyo", "paris"])

array([ 2, 2, 1])

>>> list(le.inverse_transform([ 2, 2, 1]))

[ 'tokyo', 'tokyo', 'paris']

7、LabelBinarizer 标签二值化

LabelBinarizer 是一个用来从多类别列表创建标签矩阵的工具类:

>>> fromsklearn importpreprocessing

>>> lb = preprocessing.LabelBinarizer

>>> lb.fit([ 1, 2, 6, 4, 2])

LabelBinarizer(neg_label= 0, pos_label= 1, sparse_output= False)

>>> lb.classes_

array([ 1, 2, 4, 6])

>>> lb.transform([ 1, 6])

array([[ 1, 0, 0, 0],

[ 0, 0, 0, 1]])

对于多类别是实例，可以使用:class:MultiLabelBinarizer:

>>> lb = preprocessing.MultiLabelBinarizer

>>> lb.fit_transform([( 1, 2), ( 3,)])

array([[ 1, 1, 0],

[ 0, 0, 1]])

>>> lb.classes_

array([ 1, 2, 3])

看完可以收藏，方便以后灵活运用哦~游戏网

热门游戏推荐

相关下载

234游戏网站地图

游戏中心	手机游戏	app大全	单机游戏
网络游戏	网页游戏	手机网游
新闻中心	网游新闻	手游新闻	单机新闻
原创栏目	综合热点	实用问答	专题文章
图片中心	美女图片	爆笑图	二次元
网游图片	单机图片	手游图片
视频中心	单机视频	手游视频	网游视频
视频综合	游戏赛事
攻略中心	单机攻略	手游攻略	网游攻略
爆料中心	娱乐八卦	新鲜事	国内新闻
评测中心	单机评测	手游评测	网游评测
子站导航
234游戏网	234问答	234游戏论坛	vr频道
问答2	问答3

干货合辑整理，Ace2新机中ColorOS有哪些游戏新特性_玩家

技术干货丨iGrinder力控打磨系列案例之一：车门框焊缝打磨

如何建立客户服务管理体系，售后宝客户服管理系统开课啦纯干货

【干货】做好淘宝直播的必修课程_店铺

阴阳师：萌新式神干货，辅助，收尾，治疗三方面一应俱全！

积分运营插件来啦！干货攻略请收下！（一）

如何做淘宝直播运营？超级干货如何做好直播_信任

三分钟教你上手新英雄镜，干货满满哟~

【一家之言】加洛特上船党的干货分析帖，认同还是反对？_之国

高校何时开学？寒暑假是否调整？这场发布会透露这些干货

导航仪上线？《天下》手游江南寻宝实用干货大放送！

干货速领！如何跑赢母婴数字化转型的新赛道？

尽显高端！第一期“新桂·私享家”高端访谈干货满满

【干货】由财富平台推荐榜重视返还率，谈野鸡砖家常用套路

干货｜宝丽星为你解读品牌海外运营成功的三大关键要素

干货贴丨安卓/iOS云顶手游客户端登录问题及解决方案

干货回顾|乐客5周年年庆活动，助力vr体验店“破蛹而出”_疫情

写给女孩子的8条干货清单_生活

CSGO攻略教学：炙热沙城2进攻细节汇总，四条干货助你上分

【干货】Adobe2019全套破解软件，含ps、pr、ae等全系列！_Acrobat

4000字干货|如何搭建游戏数据分析平台_统计

干货|七项模特健身法则科学的训练才是王道_运动

「干货贴」云顶之弈手游可以下载试玩啦！速速收藏！

男士搭配干货，直接上图_男生

[实战干货]《DFQ》开发随录全集(1-9)_lorien

注塑工艺计算公式大全，网友：内容全是干货_压力

比化妆重要一百倍的卸妆干货分享，详细又全面你都做对了吗？_肌肤

干货还是噱头？众多国产游戏引入光追技术，玩家：本质还是垃圾

干货还是噱头？众多国产游戏引入光追技术，玩家：本质还是垃圾_画质

干货来袭《问道》野外BOSS击杀攻略

退而求其次！云顶9.23版本稳分干货阵容推荐_装备

新赛季白起正在悄然崛起，学会精髓轻松上分，大神干货分享

【短视频干货】抖音运营10问

《洛奇》14周年见面会完美收官！超多干货首爆！

绝对干货《颤抖吧三国》神器升段所需材料大盘点（一）

干货教学：超强对抗路渔夫孙策教程，战斗局势全在一船中

干货教学：超强对抗路渔夫孙策教程，战斗局势全在一船中_技能

干货！球鞋的几大属性，哪个才是你最看重的关键点？

百家号引流之被动吸粉大法，干货大放送

倏忽相聚，共论创新：干货满满的嘉年华落下帷幕

曝微软X019发售日期干货满满黑曜石将公布全新IP

曝微软X019干货满满将公布很多发售日黑曜石还有全新IP！

干货超多《魔兽世界》9.0将会带来的那些良心变动

《亲爱的客栈》解锁职场干货教你如何让工作事半功倍

高手干货丨创业企业想要有所突破，创始人需要想清楚哪些问题？

旅拍云｜干货分享：怎么让影楼业绩提高80%

干货来袭 手把手教你计算《问道》人物属性

干货！10个免费的PDF文献资源网站推荐

第二届进博会新闻通气会讲了这些干货

第二届进博会新闻通气会正举行！有这些干货

大家都在搜

干货来袭手把手教你计算《问道》人物属性