Notice
Recent Posts
Recent Comments
Link
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 | 12 | 13 | 14 |
15 | 16 | 17 | 18 | 19 | 20 | 21 |
22 | 23 | 24 | 25 | 26 | 27 | 28 |
29 | 30 | 31 |
Tags
- 구글
- Kubernetes
- r
- 심층신경망
- ADP 실기
- 대감집 체험기
- 머신러닝
- LDA
- do it
- 차원 축소
- TooBigToInnovate
- 쿠버네티스
- python
- 빅쿼리
- bigquery
- ADP
- 최적화
- docker
- 리액트
- 클러스터링
- 캐글
- 대감집
- Machine Learning
- 프론트엔드
- DBSCAN
- 파이썬
- 타입스크립트
- React
- Kaggle
- frontend
Archives
- Today
- Total
No Story, No Ecstasy
[Kaggle Feature Engineering] Python basic code 본문
1. Feature Engineering
# Feature Engineering
# Example) improve performance through feature engineering
X = df.copy()
y = X.pop("CompressiveStrength")
# Train and score baseline model
baseline = RandomForestRegressor(criterion="mae", random_state=0)
baseline_score = cross_val_score(baseline, X, y, cv=5, scoring="neg_mean_absolute_error")
baseline_score = -1 * baseline_score.mean()
print(f"MAE Baseline Score: {baseline_score:.4}") # = 8.232
# Create synthetic features
X["FCRatio"] = X["FineAggregate"] / X["CoarseAggregate"]
X["AggCmtRatio"] = (X["CoarseAggregate"] + X["FineAggregate"]) / X["Cement"]
X["WtrCmtRatio"] = X["Water"] / X["Cement"]
# Train and score model on dataset with additional ratio features
model = RandomForestRegressor(criterion="mae", random_state=0)
score = cross_val_score(model, X, y, cv=5, scoring="neg_mean_absolute_error")
score = -1 * score.mean()
print(f"MAE Score with Ratio Features: {score:.4}") # = 7.948
2. Mutual Information (MI)
# Mutual Information (a lot like correlation)
# 1. Construct a ranking with a feature utility metric
# - The Mutual Information between two quantities is a measure of the extent to which knowledge of one quantity reduces uncertainty about the other.
# - When MI is zero, the quantities are independent.
# - Conversely, in theory there's no upper bound to what MI can be. In practice though values above 2.0 or so are uncommon. (Mutual information is a logarithmic quantity, so it increases very slowly.)
# The scikit-learn algorithm for MI treats discrete features differently from continuous features.
# Categoricals (object or categorial dtype) can be treated as discrete by giving them a label encoding.
X = df.copy()
y = X.pop("price")
y[y=="?"] = 0
# Label encoding for categoricals
for colname in X.select_dtypes("object"):
X[colname], _ = X[colname].factorize()
#Double-checked
discrete_features = X.dtypes == int
print(discrete_features)
# Scikit-learn has two mutual information metrics in its feature_selection module:
# - one for real-valued targets (mutual_info_regression) and
# - one for categorical targets (mutual_info_classif). Our target, price, is real-valued.
from sklearn.feature_selection import mutual_info_regression
def make_mi_scores(X, y, discrete_features):
mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)
mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
mi_scores = mi_scores.sort_values(ascending=False)
return mi_scores
mi_scores = make_mi_scores(X, y, discrete_features)
print(mi_scores[::3])
sns.relplot(x='horsepower', y='price', data=df)
# Before deciding a feature is unimportant from its MI score, it's good to investigate any possible interaction effects -- domain knowledge can offer a lot of guidance here.
df2 = df.copy()
df2['fuel-type'] = df2['fuel-type'].astype('str')
print(df2['fuel-type'].dtype)
#sns.lmplot(x="horsepower", y="price", hue="fuel-type", data=df2)
#features = ["YearBuilt", "MoSold", "ScreenPorch"]
#sns.relplot(x="value", y="SalePrice", col="variable", data=df.melt(id_vars="SalePrice", value_vars=features), facet_kws=dict(sharex=False))
#sns.catplot(x="BldgType", y="SalePrice", data=df, kind="boxen")
#feature = "GrLivArea"
#sns.lmplot(x=feature, y="SalePrice", hue="BldgType", col="BldgType",
# data=df, scatter_kws={"edgecolor": 'w'}, col_wrap=3, height=4)
3. Creating Features
# Creating Features
# 1. Mathematical Transforms
autos["stroke_ratio"] = autos.stroke / autos.bore
autos["displacement"] = (np.pi * ((0.5 * autos.bore) ** 2) * autos.stroke * autos.num_of_cylinders)
# If the feature has 0.0 values, use np.log1p (log(1+x)) instead of np.log
accidents["LogWindSpeed"] = accidents.WindSpeed.apply(np.log1p) # Apply logarithm to highly skewed data
# 2. Counts
roadway_features = ["Amenity", "Bump", "Crossing", "GiveWay",
"Junction", "NoExit", "Railway", "Roundabout", "Station", "Stop",
"TrafficCalming", "TrafficSignal"]
accidents["RoadwayFeatures"] = accidents[roadway_features].sum(axis=1)
components = [ "Cement", "BlastFurnaceSlag", "FlyAsh", "Water",
"Superplasticizer", "CoarseAggregate", "FineAggregate"]
concrete["Components"] = concrete[components].gt(0).sum(axis=1)
# 3. Building-Up and Breaking-Down Features
customer[["Type", "Level"]] = ( # Create two new features
customer["Policy"] # from the Policy feature
.str # through the string accessor
.split(" ", expand=True) # by splitting on " "
# and expanding the result into separate columns
)
autos["make_and_style"] = autos["make"] + "_" + autos["body_style"]
# 4. Group Transforms
customer["AverageIncome"] = (
customer.groupby("State") # for each state
["Income"] # select the income
.transform("mean") # and compute its mean
)
customer["StateFreq"] = (
customer.groupby("State")
["State"]
.transform("count")
/ customer.State.count()
)
# Clustering With K-Means
# Create cluster feature
kmeans = KMeans(n_clusters=6)
X["Cluster"] = kmeans.fit_predict(X)
X["Cluster"] = X["Cluster"].astype("category")
sns.relplot(x="Longitude", y="Latitude", hue="Cluster", data=X, height=6)
4. PCA, Smoothing, Encoding
# PCA (Principal Component Analysis)
# - Purpose: Dimensionality Reduction, Anomaly Detection, Noise Reduction, Decorrelation
features = ["highway_mpg", "engine_size", "horsepower", "curb_weight"]
X = df.copy()
y = X.pop('price')
X = X.loc[:, features]
# Standardize
X_scaled = (X - X.mean(axis=0)) / X.std(axis=0)
from sklearn.decomposition import PCA
# Create principal components
pca = PCA()
X_pca = pca.fit_transform(X_scaled)
# Convert to dataframe
component_names = [f"PC{i+1}" for i in range(X_pca.shape[1])]
X_pca = pd.DataFrame(X_pca, columns=component_names)
# Look at explained variance
plot_variance(pca)
mi_scores = make_mi_scores(X_pca, y, discrete_features=False)
df["sports_or_wagon"] = X.curb_weight / X.horsepower
sns.regplot(x="sports_or_wagon", y='price', data=df, order=2)
# Target Encoding: Replaces a feature's categories with some number derived from the target
autos["make_encoded"] = autos.groupby("make")["price"].transform("mean")
# Smoothing: prevent from overfitting (missing values and ocurrs a few times)
weight = n / (n + m)
#where n is the total number of times that category occurs in the data. The parameter m determines the "smoothing factor". Larger values of m put more weight on the overall estimate.
encoding = weight * in_category + (1 - weight) * overall
'''Use Cases for Target Encoding
- Target encoding is great for:
- High-cardinality features: A feature with a large number of categories can be troublesome to
encode: a one-hot encoding would generate too many features and alternatives, like a label
encoding, might not be appropriate for that feature. A target encoding derives numbers for the
categories using the feature's most important property: its relationship with the target.
- Domain-motivated features: From prior experience, you might suspect that a categorical feature
should be important even if it scored poorly with a feature metric. A target encoding can help
reveal a feature's true informativeness.
'''
# Example
#With over 3000 categories, the Zipcode feature makes a good candidate for target encoding, and the size of this dataset (over one-million rows) means we can spare some data to create the encoding. We'll start by creating a 25% split to train the target encoder.
X = df.copy()
y = X.pop('Rating')
X_encode = X.sample(frac=0.25)
y_encode = y[X_encode.index]
X_pretrain = X.drop(X_encode.index)
y_train = y[X_pretrain.index]
#The category_encoders package in scikit-learn-contrib implements an m-estimate encoder, which we'll use to encode our Zipcode feature.
from category_encoders import MEstimateEncoder
# Create the encoder instance. Choose m to control noise.
encoder = MEstimateEncoder(cols=["Zipcode"], m=5.0)
# Fit the encoder on the encoding split.
encoder.fit(X_encode, y_encode)
# Encode the Zipcode column to create the final training data
X_train = encoder.transform(X_pretrain)
#Let's compare the encoded values to the target to see how informative our encoding might be.
plt.figure(dpi=90)
ax = sns.distplot(y, kde=False, norm_hist=True)
ax = sns.kdeplot(X_train.Zipcode, color='r', ax=ax)
ax.set_xlabel("Rating")
ax.legend(labels=['Zipcode', 'Rating'])
#The distribution of the encoded Zipcode feature roughly follows the distribution of the actual ratings, meaning that movie-watchers differed enough in their ratings from zipcode to zipcode that our target encoding was able to capture useful information.
'Data Science Series' 카테고리의 다른 글
[Kaggle Visualization] Python basic code (0) | 2021.04.28 |
---|---|
[Kaggle Pandas] Python basic code (0) | 2021.04.28 |
[Kaggle Data Cleaning] Python basic code (0) | 2021.04.28 |
[Kaggle Intermediate Machine Learning] Python basic code (0) | 2021.04.28 |
[Kaggle Intro to Machine Learning] Python basic code (0) | 2021.04.28 |