No Story, No Ecstasy

[Kaggle Feature Engineering] Python basic code 본문

Data Science Series

[Kaggle Feature Engineering] Python basic code

heave_17 2021. 4. 28. 22:00

1. Feature Engineering

# Feature Engineering

# Example) improve performance through feature engineering
X = df.copy()
y = X.pop("CompressiveStrength")

# Train and score baseline model
baseline = RandomForestRegressor(criterion="mae", random_state=0)
baseline_score = cross_val_score(baseline, X, y, cv=5, scoring="neg_mean_absolute_error")
baseline_score = -1 * baseline_score.mean()

print(f"MAE Baseline Score: {baseline_score:.4}") # = 8.232

# Create synthetic features
X["FCRatio"] = X["FineAggregate"] / X["CoarseAggregate"]
X["AggCmtRatio"] = (X["CoarseAggregate"] + X["FineAggregate"]) / X["Cement"]
X["WtrCmtRatio"] = X["Water"] / X["Cement"]

# Train and score model on dataset with additional ratio features
model = RandomForestRegressor(criterion="mae", random_state=0)
score = cross_val_score(model, X, y, cv=5, scoring="neg_mean_absolute_error")
score = -1 * score.mean()

print(f"MAE Score with Ratio Features: {score:.4}") # = 7.948

2. Mutual Information (MI)

# Mutual Information (a lot like correlation)

# 1. Construct a ranking with a feature utility metric
#  - The Mutual Information between two quantities is a measure of the extent to which knowledge of one quantity reduces uncertainty about the other.
#  - When MI is zero, the quantities are independent.
#  - Conversely, in theory there's no upper bound to what MI can be. In practice though values above 2.0 or so are uncommon. (Mutual information is a logarithmic quantity, so it increases very slowly.)

# The scikit-learn algorithm for MI treats discrete features differently from continuous features.
# Categoricals (object or categorial dtype) can be treated as discrete by giving them a label encoding. 

X = df.copy()
y = X.pop("price")
y[y=="?"] = 0

# Label encoding for categoricals
for colname in X.select_dtypes("object"):
    X[colname], _ = X[colname].factorize()

#Double-checked
discrete_features = X.dtypes == int
print(discrete_features)

# Scikit-learn has two mutual information metrics in its feature_selection module: 
#  - one for real-valued targets (mutual_info_regression) and 
#  - one for categorical targets (mutual_info_classif). Our target, price, is real-valued.

from sklearn.feature_selection import mutual_info_regression

def make_mi_scores(X, y, discrete_features):
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

mi_scores = make_mi_scores(X, y, discrete_features)
print(mi_scores[::3])

sns.relplot(x='horsepower', y='price', data=df)

# Before deciding a feature is unimportant from its MI score, it's good to investigate any possible interaction effects -- domain knowledge can offer a lot of guidance here.
df2 = df.copy()
df2['fuel-type'] = df2['fuel-type'].astype('str')
print(df2['fuel-type'].dtype)
#sns.lmplot(x="horsepower", y="price", hue="fuel-type", data=df2)
#features = ["YearBuilt", "MoSold", "ScreenPorch"]
#sns.relplot(x="value", y="SalePrice", col="variable", data=df.melt(id_vars="SalePrice", value_vars=features), facet_kws=dict(sharex=False))
#sns.catplot(x="BldgType", y="SalePrice", data=df, kind="boxen")
#feature = "GrLivArea"
#sns.lmplot(x=feature, y="SalePrice", hue="BldgType", col="BldgType",
#           data=df, scatter_kws={"edgecolor": 'w'}, col_wrap=3, height=4)

3. Creating Features

# Creating Features

# 1. Mathematical Transforms
autos["stroke_ratio"] = autos.stroke / autos.bore
autos["displacement"] = (np.pi * ((0.5 * autos.bore) ** 2) * autos.stroke * autos.num_of_cylinders)
# If the feature has 0.0 values, use np.log1p (log(1+x)) instead of np.log
accidents["LogWindSpeed"] = accidents.WindSpeed.apply(np.log1p) # Apply logarithm to highly skewed data

# 2. Counts
roadway_features = ["Amenity", "Bump", "Crossing", "GiveWay",
                    "Junction", "NoExit", "Railway", "Roundabout", "Station", "Stop",
                    "TrafficCalming", "TrafficSignal"]
accidents["RoadwayFeatures"] = accidents[roadway_features].sum(axis=1)
components = [ "Cement", "BlastFurnaceSlag", "FlyAsh", "Water",
               "Superplasticizer", "CoarseAggregate", "FineAggregate"]
concrete["Components"] = concrete[components].gt(0).sum(axis=1)

# 3. Building-Up and Breaking-Down Features
customer[["Type", "Level"]] = (  # Create two new features
    customer["Policy"]           # from the Policy feature
    .str                         # through the string accessor
    .split(" ", expand=True)     # by splitting on " "
                                 # and expanding the result into separate columns
)
autos["make_and_style"] = autos["make"] + "_" + autos["body_style"]

# 4. Group Transforms
customer["AverageIncome"] = (
    customer.groupby("State")  # for each state
    ["Income"]                 # select the income
    .transform("mean")         # and compute its mean
)
customer["StateFreq"] = (
    customer.groupby("State")
    ["State"]
    .transform("count")
    / customer.State.count()
)

# Clustering With K-Means

# Create cluster feature
kmeans = KMeans(n_clusters=6)
X["Cluster"] = kmeans.fit_predict(X)
X["Cluster"] = X["Cluster"].astype("category")
sns.relplot(x="Longitude", y="Latitude", hue="Cluster", data=X, height=6)

4. PCA, Smoothing, Encoding

# PCA (Principal Component Analysis)
#  - Purpose: Dimensionality Reduction, Anomaly Detection, Noise Reduction, Decorrelation

features = ["highway_mpg", "engine_size", "horsepower", "curb_weight"]
X = df.copy()
y = X.pop('price')
X = X.loc[:, features]

# Standardize
X_scaled = (X - X.mean(axis=0)) / X.std(axis=0)

from sklearn.decomposition import PCA

# Create principal components
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Convert to dataframe
component_names = [f"PC{i+1}" for i in range(X_pca.shape[1])]
X_pca = pd.DataFrame(X_pca, columns=component_names)

# Look at explained variance
plot_variance(pca)
mi_scores = make_mi_scores(X_pca, y, discrete_features=False)
df["sports_or_wagon"] = X.curb_weight / X.horsepower
sns.regplot(x="sports_or_wagon", y='price', data=df, order=2)

# Target Encoding: Replaces a feature's categories with some number derived from the target

autos["make_encoded"] = autos.groupby("make")["price"].transform("mean")

# Smoothing: prevent from overfitting (missing values and ocurrs a few times)
weight = n / (n + m) 
#where n is the total number of times that category occurs in the data. The parameter m determines the "smoothing factor". Larger values of m put more weight on the overall estimate.
encoding = weight * in_category + (1 - weight) * overall

'''Use Cases for Target Encoding
 - Target encoding is great for:
  - High-cardinality features: A feature with a large number of categories can be troublesome to 
  encode: a one-hot encoding would generate too many features and alternatives, like a label 
  encoding, might not be appropriate for that feature. A target encoding derives numbers for the 
  categories using the feature's most important property: its relationship with the target.
  
  - Domain-motivated features: From prior experience, you might suspect that a categorical feature 
  should be important even if it scored poorly with a feature metric. A target encoding can help 
  reveal a feature's true informativeness.
'''

# Example

#With over 3000 categories, the Zipcode feature makes a good candidate for target encoding, and the size of this dataset (over one-million rows) means we can spare some data to create the encoding. We'll start by creating a 25% split to train the target encoder.

X = df.copy()
y = X.pop('Rating')

X_encode = X.sample(frac=0.25)
y_encode = y[X_encode.index]
X_pretrain = X.drop(X_encode.index)
y_train = y[X_pretrain.index]

#The category_encoders package in scikit-learn-contrib implements an m-estimate encoder, which we'll use to encode our Zipcode feature.
from category_encoders import MEstimateEncoder

# Create the encoder instance. Choose m to control noise.
encoder = MEstimateEncoder(cols=["Zipcode"], m=5.0)

# Fit the encoder on the encoding split.
encoder.fit(X_encode, y_encode)

# Encode the Zipcode column to create the final training data
X_train = encoder.transform(X_pretrain)

#Let's compare the encoded values to the target to see how informative our encoding might be.
plt.figure(dpi=90)
ax = sns.distplot(y, kde=False, norm_hist=True)
ax = sns.kdeplot(X_train.Zipcode, color='r', ax=ax)
ax.set_xlabel("Rating")
ax.legend(labels=['Zipcode', 'Rating'])

#The distribution of the encoded Zipcode feature roughly follows the distribution of the actual ratings, meaning that movie-watchers differed enough in their ratings from zipcode to zipcode that our target encoding was able to capture useful information.