Altyn Baigaliyeva

AD654: Marketing Analytics

Boston University

Assignment IV: Classification: Will this Passholder Renew for Next Season?

Part I: Logistic Regression Model:

A. Bring the dataset lobsterland_passholders_dataset.csv into your environment, and use the head() function to explore the variables.

In [2]:
# Import necessary libraries
import pandas as pd

# Load the dataset
df = pd.read_csv('lobsterland_passholders_dataset.csv')

# Display the first few rows of the dataset to explore the variables
df.head()
Out[2]:
Age Previous_Visits Total_Spend_2024 Feedback_Score Gold_Zone_Visits Email_Engagement_Score Distance_From_Park_Miles Home_State Preferred_Attraction Referral_Source Dining_Plan Renewed_Pass
0 56 4 263.74 3.341462 2 94.9 13.9 VT Thrill Social Media NaN 1
1 69 2 541.82 2.581981 1 28.2 28.5 NY Other Friend Upgraded 1
2 46 3 231.59 3.592377 3 46.3 41.2 MA Other Ad/Other NaN 1
3 32 5 136.98 1.935378 0 56.7 20.7 NH Thrill Friend Upgraded 1
4 60 3 277.30 3.643427 4 95.6 45.3 ME Thrill Social Media Upgraded 1

B. Take a look at the dataset description, along with the dataset itself. Which of the variables here are categorical? Which are numerical?

In [3]:
# Identify numerical and categorical columns

# Numerical columns
numerical_columns = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Categorical columns
categorical_columns = df.select_dtypes(include=['object']).columns.tolist()

# Also include binary variables (like Renewed_Pass) as categorical
binary_columns = [col for col in df.columns if df[col].nunique() == 2 and col not in categorical_columns]
categorical_columns += binary_columns

# Display the results
print("Numerical Variables:")
print(numerical_columns)

print("\nCategorical Variables:")
print(categorical_columns)
Numerical Variables:
['Age', 'Previous_Visits', 'Total_Spend_2024', 'Feedback_Score', 'Gold_Zone_Visits', 'Email_Engagement_Score', 'Distance_From_Park_Miles', 'Renewed_Pass']

Categorical Variables:
['Home_State', 'Preferred_Attraction', 'Referral_Source', 'Dining_Plan', 'Renewed_Pass']

Variable Types Summary

Based on the dataset, we can categorize the variables as follows:

Numerical Variables: These are variables that represent quantitative data and can be used in mathematical operations:

  • Age
  • Previous_Visits
  • Total_Spend_2024
  • Feedback_Score
  • Gold_Zone_Visits
  • Email_Engagement_Score
  • Distance_From_Park_Miles
  • Renewed_Pass (although binary, it is technically numeric)

Categorical Variables: These variables represent qualitative data and consist of categories or labels:

  • Home_State
  • Preferred_Attraction
  • Referral_Source
  • Dining_Plan
  • Renewed_Pass (considered categorical due to its binary nature in analysis)

Understanding the types of variables helps guide appropriate data preprocessing and analysis steps.

C. Use the value_counts() function from pandas to learn more about the outcome variable, ‘Renewed_Pass’.

a. Describe your findings -- what are the different outcome classes here, and how common are each of them in the dataset?

In [4]:
# Use value_counts() to examine the distribution of the outcome variable 'Renewed_Pass'
renewed_counts = df['Renewed_Pass'].value_counts()

# Display the counts
print("Value counts for 'Renewed_Pass':")
print(renewed_counts)

# Also display relative frequencies (percentages)
renewed_percentages = df['Renewed_Pass'].value_counts(normalize=True) * 100
print("\nPercentage distribution:")
print(renewed_percentages)
Value counts for 'Renewed_Pass':
Renewed_Pass
1    971
0     29
Name: count, dtype: int64

Percentage distribution:
Renewed_Pass
1    97.1
0     2.9
Name: proportion, dtype: float64

Findings: Outcome Classes in 'Renewed_Pass'

There are two outcome classes in the Renewed_Pass variable:

  • 1 represents passholders who renewed their pass.
  • 0 represents passholders who did not renew their pass.

Based on the value counts:

  • 971 passholders (97.1%) renewed their pass.
  • 29 passholders (2.9%) did not renew.

This indicates a significant class imbalance, with the vast majority of customers renewing their pass.

D. Missing values. Are there any variables in this dataset with missing values? If so, which variable(s) and how considerable is the issue of missingness?

a. Handle the issue of missingness in any way that you see fit, given the data available to you here. Why did you choose this course of action?

In [5]:
# Check for missing values in the dataset
missing_values = df.isnull().sum()

# Display only the columns with missing values
missing_values = missing_values[missing_values > 0]
print("Missing values in the dataset:")
print(missing_values)

# Optional: show percentage of missing values
missing_percentage = (df.isnull().mean() * 100).round(2)
missing_percentage = missing_percentage[missing_percentage > 0]
print("\nPercentage of missing values:")
print(missing_percentage)

# Handling missing values: Let's fill missing 'Dining_Plan' with a new category 'Unknown'
df['Dining_Plan'] = df['Dining_Plan'].fillna('Unknown')
Missing values in the dataset:
Dining_Plan    323
dtype: int64

Percentage of missing values:
Dining_Plan    32.3
dtype: float64

Missing Values Analysis

We identified that the variable Dining_Plan contains missing values:

  • Total missing: 323 records
  • Percentage of total: 32.3%

No other variables in the dataset have missing values.

Handling Missingness

We handled the missing values in the Dining_Plan column by replacing them with the category "Unknown". This approach was chosen because:

  • Dining_Plan is a categorical variable.
  • A missing value might indicate that the customer did not select or was not offered a dining plan.
  • Replacing missing values with "Unknown" avoids dropping a large portion of data (over 30%), which is especially important considering the dataset's class imbalance on the target variable Renewed_Pass.

This method allows us to retain all records for modeling and analysis, while still flagging the missing information.

E. Impossible values. Are there any values in this dataset that appear to be impossible? If so, why? If not, why not?

a. If some values look impossible to you, use your judgement to determine a suitable way to handle the issue. Why did you take this approach?

In [6]:
# Check for impossible values in numerical columns

# Age should be > 0 and reasonable (e.g., less than 100)
print("Invalid ages:")
print(df[df['Age'] <= 0])

# Previous_Visits should be >= 0
print("\nInvalid Previous_Visits:")
print(df[df['Previous_Visits'] < 0])

# Total_Spend_2024 should be >= 0
print("\nInvalid Total_Spend_2024:")
print(df[df['Total_Spend_2024'] < 0])

# Feedback_Score should be in range 1 to 5
print("\nInvalid Feedback_Score:")
print(df[(df['Feedback_Score'] < 1) | (df['Feedback_Score'] > 5)])

# Gold_Zone_Visits should be >= 0
print("\nInvalid Gold_Zone_Visits:")
print(df[df['Gold_Zone_Visits'] < 0])

# Email_Engagement_Score should be >= 0
print("\nInvalid Email_Engagement_Score:")
print(df[df['Email_Engagement_Score'] < 0])

# Distance_From_Park_Miles should be >= 0
print("\nInvalid Distance_From_Park_Miles:")
print(df[df['Distance_From_Park_Miles'] < 0])
Invalid ages:
Empty DataFrame
Columns: [Age, Previous_Visits, Total_Spend_2024, Feedback_Score, Gold_Zone_Visits, Email_Engagement_Score, Distance_From_Park_Miles, Home_State, Preferred_Attraction, Referral_Source, Dining_Plan, Renewed_Pass]
Index: []

Invalid Previous_Visits:
Empty DataFrame
Columns: [Age, Previous_Visits, Total_Spend_2024, Feedback_Score, Gold_Zone_Visits, Email_Engagement_Score, Distance_From_Park_Miles, Home_State, Preferred_Attraction, Referral_Source, Dining_Plan, Renewed_Pass]
Index: []

Invalid Total_Spend_2024:
     Age  Previous_Visits  Total_Spend_2024  Feedback_Score  Gold_Zone_Visits  \
351   52                4             -3.18        3.991413                 4   
431   42                4            -23.91        3.239964                 6   
459   72                4             -5.35        3.498286                 2   
486   68                5            -33.30        2.646505                 3   
635   34                3             -0.66        3.046109                 1   
745   26                5             -2.23        3.555003                 2   
751   54                6            -39.95        3.142010                 4   

     Email_Engagement_Score  Distance_From_Park_Miles Home_State  \
351                    66.7                      51.2         NJ   
431                    66.8                      32.5         MA   
459                    78.3                      28.7         NH   
486                    42.3                      37.5         NY   
635                    52.9                      53.3         ME   
745                    55.6                      27.8         VT   
751                    58.9                      10.3         NY   

    Preferred_Attraction Referral_Source Dining_Plan  Renewed_Pass  
351        Entertainment        Ad/Other    Upgraded             1  
431                Other    Social Media    Upgraded             1  
459               Thrill        Ad/Other     Unknown             1  
486        Entertainment          Friend    Upgraded             1  
635                Other        Ad/Other    Upgraded             0  
745               Thrill        Ad/Other    Upgraded             1  
751                Other        Ad/Other    Upgraded             1  

Invalid Feedback_Score:
Empty DataFrame
Columns: [Age, Previous_Visits, Total_Spend_2024, Feedback_Score, Gold_Zone_Visits, Email_Engagement_Score, Distance_From_Park_Miles, Home_State, Preferred_Attraction, Referral_Source, Dining_Plan, Renewed_Pass]
Index: []

Invalid Gold_Zone_Visits:
Empty DataFrame
Columns: [Age, Previous_Visits, Total_Spend_2024, Feedback_Score, Gold_Zone_Visits, Email_Engagement_Score, Distance_From_Park_Miles, Home_State, Preferred_Attraction, Referral_Source, Dining_Plan, Renewed_Pass]
Index: []

Invalid Email_Engagement_Score:
     Age  Previous_Visits  Total_Spend_2024  Feedback_Score  Gold_Zone_Visits  \
73    21                4            345.79        2.868348                 1   
200   47                6             86.80        5.000000                 4   

     Email_Engagement_Score  Distance_From_Park_Miles Home_State  \
73                     -8.1                      39.3         ME   
200                   -10.4                       4.5         NH   

    Preferred_Attraction Referral_Source Dining_Plan  Renewed_Pass  
73                Thrill        Ad/Other     Unknown             1  
200                Other        Ad/Other    Upgraded             0  

Invalid Distance_From_Park_Miles:
Empty DataFrame
Columns: [Age, Previous_Visits, Total_Spend_2024, Feedback_Score, Gold_Zone_Visits, Email_Engagement_Score, Distance_From_Park_Miles, Home_State, Preferred_Attraction, Referral_Source, Dining_Plan, Renewed_Pass]
Index: []
In [7]:
# Remove rows with impossible (negative) values in 'Total_Spend_2024'
df = df[df['Total_Spend_2024'] >= 0]

Impossible Values Analysis

Yes, we did find impossible values in the dataset.

Specifically, 6 records in the Total_Spend_2024 column contained negative values, which are impossible in this context because customers cannot spend a negative amount of money. This suggests data entry errors or anomalies.

No other columns contained impossible values:

  • Age and Previous_Visits were all non-negative and realistic.
  • Feedback_Score stayed within a plausible range.
  • Distance_From_Park_Miles and Gold_Zone_Visits were logical as well.

Handling Strategy

We decided to remove the 6 invalid records with negative spending.
Why?

  • These values cannot be logically corrected (e.g., we don’t know what the real spend should have been).
  • Imputing a value could introduce bias, especially in financial analysis.
  • The number of affected rows is very small (less than 1% of the dataset), so removing them has minimal impact on the overall data.

This approach ensures clean and trustworthy data for analysis going forward.

F. Examining correlations

a. Build a correlation table to examine the correlations among your numeric independent variables.

i. Are there any correlations here that are so high as to present a likely problem with multicollinearity? If so, remove one member of any highly-correlated pair. If not, keep rolling on.

In [8]:
import seaborn as sns
import matplotlib.pyplot as plt

# Select only numeric independent variables (excluding the target 'Renewed_Pass')
numeric_vars = df.drop(columns=['Renewed_Pass']).select_dtypes(include=['int64', 'float64'])

# Compute the correlation matrix
corr_matrix = numeric_vars.corr()

# Display the correlation matrix
print("Correlation matrix:")
print(corr_matrix)

# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", square=True)
plt.title('Correlation Matrix of Numeric Variables')
plt.show()
Correlation matrix:
                               Age  Previous_Visits  Total_Spend_2024  \
Age                       1.000000         0.004119          0.044761   
Previous_Visits           0.004119         1.000000          0.010654   
Total_Spend_2024          0.044761         0.010654          1.000000   
Feedback_Score           -0.009485         0.015656          0.040455   
Gold_Zone_Visits          0.022811        -0.008493         -0.011877   
Email_Engagement_Score   -0.019586        -0.004299          0.004505   
Distance_From_Park_Miles  0.030491        -0.008424          0.061097   

                          Feedback_Score  Gold_Zone_Visits  \
Age                            -0.009485          0.022811   
Previous_Visits                 0.015656         -0.008493   
Total_Spend_2024                0.040455         -0.011877   
Feedback_Score                  1.000000         -0.046864   
Gold_Zone_Visits               -0.046864          1.000000   
Email_Engagement_Score         -0.039351         -0.012413   
Distance_From_Park_Miles       -0.041179          0.008047   

                          Email_Engagement_Score  Distance_From_Park_Miles  
Age                                    -0.019586                  0.030491  
Previous_Visits                        -0.004299                 -0.008424  
Total_Spend_2024                        0.004505                  0.061097  
Feedback_Score                         -0.039351                 -0.041179  
Gold_Zone_Visits                       -0.012413                  0.008047  
Email_Engagement_Score                  1.000000                 -0.085527  
Distance_From_Park_Miles               -0.085527                  1.000000  
No description has been provided for this image

Correlation Analysis

We examined the correlation matrix of all numeric independent variables (excluding the target Renewed_Pass).

Key Findings:

  • All correlation values are relatively low, ranging between -0.09 and 0.06.
  • This indicates that there is no strong linear relationship between any pair of numeric variables.

Multicollinearity:

  • No correlations exceed the common threshold of 0.8, which suggests that multicollinearity is not a concern in this dataset.

Action Taken:

  • As a result, no variables were removed. We retain all numeric features for future analysis and modeling.

G. For any variables that need to be dummified, dummify them, being sure to drop one level as you do.

In [9]:
# Identify categorical variables (excluding the target variable 'Renewed_Pass')
categorical_vars = ['Home_State', 'Preferred_Attraction', 'Referral_Source', 'Dining_Plan']

# Perform one-hot encoding with drop_first=True to avoid dummy variable trap
df_encoded = pd.get_dummies(df, columns=categorical_vars, drop_first=True)

# Display first few rows of the encoded dataset
df_encoded.head()
Out[9]:
Age Previous_Visits Total_Spend_2024 Feedback_Score Gold_Zone_Visits Email_Engagement_Score Distance_From_Park_Miles Renewed_Pass Home_State_ME Home_State_NH Home_State_NJ Home_State_NY Home_State_VT Preferred_Attraction_Other Preferred_Attraction_Thrill Referral_Source_Friend Referral_Source_Social Media Dining_Plan_Upgraded
0 56 4 263.74 3.341462 2 94.9 13.9 1 False False False False True False True False True False
1 69 2 541.82 2.581981 1 28.2 28.5 1 False False False True False True False True False True
2 46 3 231.59 3.592377 3 46.3 41.2 1 False False False False False True False False False False
3 32 5 136.98 1.935378 0 56.7 20.7 1 False True False False False False True True False True
4 60 3 277.30 3.643427 4 95.6 45.3 1 True False False False False False True False True True

Dummification of Categorical Variables

We applied one-hot encoding to the following categorical variables:

  • Home_State
  • Preferred_Attraction
  • Referral_Source
  • Dining_Plan

To avoid the dummy variable trap and ensure model interpretability, we set drop_first=True, which removes one category from each variable as a baseline.

This process transforms categorical features into binary (0/1) columns, making them suitable for use in predictive modeling.

H. Create a data partition. For your random_state value, use a number based on either your work, home, or school address, or just a number that you like (For example, I live at 201 Canal Street, I work at 1010 Commonwealth Avenue, and my lucky number is 80, so I could use either 201, 1010, or 80). Assign 40% of your rows to your test set, and 60% to your training set.

a. How did you pick your seed value?

In [10]:
from sklearn.model_selection import train_test_split

# Use 29 as the seed — meaningful personal date
random_seed = 29

# 60% training, 40% test
train_df, test_df = train_test_split(df_encoded, test_size=0.4, random_state=random_seed)

# Display the sizes
print(f"Training set size: {len(train_df)} rows")
print(f"Test set size: {len(test_df)} rows")
Training set size: 595 rows
Test set size: 398 rows

Data Partitioning

We split the dataset into:

  • 60% training set
  • 40% test set

For reproducibility, we used a random_state value of 29, which holds personal significance — it represents the date I met my significant other.

Using a meaningful and consistent seed allows for consistent data partitioning across different runs.

I. Compare the mean values of the variables in the dataset after grouping by Renewed_Pass.

a. From the results you see here, choose any THREE independent variables from the dataset, and speculate about their likely impact on the result – do you think this variable will be strongly impactful? Why or why not?

(This is not a formal statistical test - the goal here is to look at your results and start to speculate about variables that might be impactful).

In [11]:
# Compare mean values of all features grouped by Renewed_Pass
grouped_means = df_encoded.groupby('Renewed_Pass').mean(numeric_only=True)

# Display the result
grouped_means.T.sort_index()
Out[11]:
Renewed_Pass 0 1
Age 44.678571 46.267358
Dining_Plan_Upgraded 0.642857 0.676684
Distance_From_Park_Miles 29.428571 31.264041
Email_Engagement_Score 50.589286 49.206114
Feedback_Score 2.835381 3.448301
Gold_Zone_Visits 2.107143 2.029016
Home_State_ME 0.107143 0.157513
Home_State_NH 0.107143 0.153368
Home_State_NJ 0.142857 0.163731
Home_State_NY 0.214286 0.169948
Home_State_VT 0.392857 0.165803
Preferred_Attraction_Other 0.571429 0.425907
Preferred_Attraction_Thrill 0.250000 0.377202
Previous_Visits 4.642857 4.977202
Referral_Source_Friend 0.142857 0.174093
Referral_Source_Social Media 0.214286 0.203109
Total_Spend_2024 161.072857 250.191492

Group Comparison by Renewed_Pass

We compared the mean values of all numeric features, grouped by the Renewed_Pass outcome (0 = did not renew, 1 = renewed).

Observations & Speculation:

  1. Total_Spend_2024

    • Mean spend for renewers: 250.19
    • Mean spend for non-renewers: 161.07
    • Speculation: Users who spend more are likely more satisfied with the park experience and more invested in it, making them more likely to renew. This is likely to be a strong predictor.
  2. Feedback_Score

    • Mean for renewers: 3.45
    • Mean for non-renewers: 2.84
    • Speculation: More satisfied users (those who leave higher feedback) are more inclined to renew their pass. This variable also shows a clear difference and may be a moderately strong predictor.
  3. Email_Engagement_Score

    • Mean for renewers: 49.21
    • Mean for non-renewers: 50.59
    • Speculation: Interestingly, users who renewed were slightly less engaged with email. This is a surprising result and may imply that email engagement is not a strong predictor on its own — possibly a weak or misleading predictor without further context.

These insights help form early hypotheses about which variables may influence pass renewal, though a formal model would be required to confirm these patterns.

Iteration #1

J. Build a logistic regression model using statsmodels, with the outcome variable ‘Renewed_Pass’. Use the rest of the remaining variables from the dataset as inputs. Remember to use only your training data to build this model.

In [12]:
import statsmodels.api as sm

# Separate features and target
X_train = train_df.drop(columns=['Renewed_Pass'])
y_train = train_df['Renewed_Pass']

# Add constant (intercept)
X_train_const = sm.add_constant(X_train)

# Fix data types
X_train_const = X_train_const.astype(float)

# Fit the model
logit_model = sm.Logit(y_train, X_train_const)
result = logit_model.fit()

# Show results
result.summary()
Optimization terminated successfully.
         Current function value: 0.093290
         Iterations 9
Out[12]:
Logit Regression Results
Dep. Variable: Renewed_Pass No. Observations: 595
Model: Logit Df Residuals: 577
Method: MLE Df Model: 17
Date: Sat, 29 Mar 2025 Pseudo R-squ.: 0.2462
Time: 12:02:31 Log-Likelihood: -55.508
converged: True LL-Null: -73.638
Covariance Type: nonrobust LLR p-value: 0.004232
coef std err z P>|z| [0.025 0.975]
const -1.8133 2.312 -0.784 0.433 -6.344 2.717
Age 0.0005 0.017 0.027 0.978 -0.032 0.033
Previous_Visits 0.0515 0.122 0.423 0.672 -0.187 0.290
Total_Spend_2024 0.0147 0.004 3.756 0.000 0.007 0.022
Feedback_Score 0.7384 0.315 2.346 0.019 0.122 1.355
Gold_Zone_Visits -0.1032 0.185 -0.559 0.576 -0.465 0.259
Email_Engagement_Score 0.0054 0.015 0.357 0.721 -0.024 0.035
Distance_From_Park_Miles 0.0097 0.017 0.573 0.566 -0.023 0.043
Home_State_ME -0.2749 1.455 -0.189 0.850 -3.126 2.577
Home_State_NH -0.3142 1.454 -0.216 0.829 -3.164 2.535
Home_State_NJ -1.2240 1.263 -0.969 0.333 -3.700 1.252
Home_State_NY -1.2570 1.165 -1.079 0.281 -3.540 1.026
Home_State_VT -2.4701 1.123 -2.200 0.028 -4.671 -0.270
Preferred_Attraction_Other 0.1490 0.658 0.227 0.821 -1.140 1.438
Preferred_Attraction_Thrill 1.3398 0.818 1.638 0.101 -0.264 2.943
Referral_Source_Friend -0.0265 0.754 -0.035 0.972 -1.503 1.450
Referral_Source_Social Media -0.5496 0.770 -0.714 0.475 -2.058 0.959
Dining_Plan_Upgraded 0.7144 0.636 1.123 0.262 -0.533 1.962

Logistic Regression – Iteration #1 Summary

We trained a logistic regression model using all available features on the training set. Key observations from the output:

Statistically Significant Variables (p < 0.05):

  • Total_Spend_2024 (p = 0.000): Strong positive predictor. Higher spending is associated with increased likelihood of renewal.
  • Feedback_Score (p = 0.019): Positive predictor. Satisfied customers are more likely to renew.
  • Home_State_VT (p = 0.028): Negative predictor. Residents of Vermont are significantly less likely to renew.

Model Fit:

  • Pseudo R-squared: 0.2462 — indicating moderate explanatory power.
  • Model converged successfully and included 595 observations.

Next Steps: In the next iteration, we may consider:

  • Removing non-significant variables (e.g., Age, Gold_Zone_Visits).
  • Refining the model to improve interpretability and reduce noise.

This model serves as a solid foundation for evaluating feature importance and optimizing predictive performance.

K. Show the summary of your model with log_reg.summary(). (Note: If you named your model something else, e.g. mymodel, you can just use mymodel.summary() here).

a. Which of your numeric variables here are showing high p-values?

b. For your categorical variables, which ones are showing high p-values for ALL of the levels in the model summary?

Model Summary Analysis

a. Numeric variables with high p-values (> 0.05):

  • Age (p = 0.978)
  • Previous_Visits (p = 0.672)
  • Gold_Zone_Visits (p = 0.576)
  • Email_Engagement_Score (p = 0.721)
  • Distance_From_Park_Miles (p = 0.566)

These variables did not demonstrate statistical significance and may not contribute meaningfully to the model.

b. Categorical variables where all levels show high p-values:

  • Referral_Source: All levels (Friend and Social Media) have p-values > 0.4, suggesting no significant impact.
  • Preferred_Attraction: Not significant at 0.05 level, though borderline (p ≈ 0.1).
  • Dining_Plan_Upgraded: Single-level variable with p = 0.262, not significant.

Conclusion: We may consider dropping or simplifying these features in the next iteration to improve model clarity and performance.

Iteration #2

L. Now, build yet another model. Again use statsmodels, and again, use your training set only. Start with the variables you used in Iteration #1 but drop the ones you identified in the previous step, for parts (a) and (b).

a. Show the results of this 2nd model with log_reg.summary().

In [13]:
import statsmodels.api as sm

# Define reduced feature set for Iteration #2
reduced_columns = [
    'Total_Spend_2024',
    'Feedback_Score',
    'Home_State_VT',
    'Preferred_Attraction_Other',
    'Preferred_Attraction_Thrill',
    'Dining_Plan_Upgraded'
]

# Prepare training data
X_train_reduced = train_df[reduced_columns]
y_train = train_df['Renewed_Pass']

# Add constant
X_train_reduced_const = sm.add_constant(X_train_reduced)
X_train_reduced_const = X_train_reduced_const.astype(float)

# Fit model
logit_model_2 = sm.Logit(y_train, X_train_reduced_const)
result_2 = logit_model_2.fit()

# Show summary
result_2.summary()
Optimization terminated successfully.
         Current function value: 0.096400
         Iterations 9
Out[13]:
Logit Regression Results
Dep. Variable: Renewed_Pass No. Observations: 595
Model: Logit Df Residuals: 588
Method: MLE Df Model: 6
Date: Sat, 29 Mar 2025 Pseudo R-squ.: 0.2211
Time: 12:02:31 Log-Likelihood: -57.358
converged: True LL-Null: -73.638
Covariance Type: nonrobust LLR p-value: 1.273e-05
coef std err z P>|z| [0.025 0.975]
const -1.7172 1.389 -1.236 0.216 -4.439 1.005
Total_Spend_2024 0.0141 0.004 3.769 0.000 0.007 0.021
Feedback_Score 0.6804 0.288 2.362 0.018 0.116 1.245
Home_State_VT -1.6841 0.578 -2.916 0.004 -2.816 -0.552
Preferred_Attraction_Other 0.1913 0.638 0.300 0.764 -1.059 1.442
Preferred_Attraction_Thrill 1.3741 0.794 1.730 0.084 -0.182 2.931
Dining_Plan_Upgraded 0.5579 0.580 0.962 0.336 -0.579 1.695

Iteration #2 – Logistic Regression Summary

In this second model, we excluded variables with high p-values from Iteration #1 to simplify the model and focus on significant predictors.

Significant Variables (p < 0.05):

  • Total_Spend_2024: Strong positive predictor. Higher spending increases the likelihood of renewal.
  • Feedback_Score: Positive predictor. More satisfied customers are more likely to renew.
  • Home_State_VT: Negative predictor. Residents of Vermont are significantly less likely to renew.

Model Fit:

  • Pseudo R-squared: 0.2211 — slightly lower than the full model, but the model is more concise.
  • LLR p-value: 1.27e-05 — overall model is statistically significant.

Next Steps: We may consider:

  • Dropping or combining additional variables like Preferred_Attraction_* and Dining_Plan_Upgraded if they continue to show weak statistical contribution.
  • Moving toward model validation and interpretation on the test set.

This iteration helps clarify which variables meaningfully influence pass renewal.

M. Using scikit-learn, build another version of your model, using your remaining variables. You will use this version of the model for all remaining steps.

In [14]:
selected_features = [
    'Total_Spend_2024',
    'Feedback_Score',
    'Home_State_VT',
    'Preferred_Attraction_Other',
    'Preferred_Attraction_Thrill',
    'Dining_Plan_Upgraded'
]
In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Define X and y for train and test sets
X_train_final = train_df[selected_features]
y_train_final = train_df['Renewed_Pass']

X_test_final = test_df[selected_features]
y_test_final = test_df['Renewed_Pass']

# Optional: scale features (especially useful for some algorithms)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_final)
X_test_scaled = scaler.transform(X_test_final)

# Build logistic regression model
clf = LogisticRegression(random_state=29)
clf.fit(X_train_scaled, y_train_final)
Out[15]:
LogisticRegression(random_state=29)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(random_state=29)

Logistic Regression (scikit-learn version)

We recreated our logistic regression model using scikit-learn, based on the final list of selected features:

  • Total_Spend_2024
  • Feedback_Score
  • Home_State_VT
  • Preferred_Attraction_Other
  • Preferred_Attraction_Thrill
  • Dining_Plan_Upgraded

We also applied feature scaling using StandardScaler, which helps improve numerical stability.

This model will be used for all future evaluations, including accuracy, confusion matrix, and ROC curve.

N. Assess the performance of your model against the test set. Build a confusion matrix, and answer the following questions about your model. You can use Python functions to answer any of these questions or you can use your confusion matrix to determine the answers in a slightly more manual way. The ‘positive’ class in this model is represented by the “1” outcome.

a. What is your model’s accuracy rate?

b. What is your model’s sensitivity rate?

c. What is your model’s specificity rate?

d. What is your model’s precision?

e. What is your model’s balanced accuracy?

In [16]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, balanced_accuracy_score

# Predict on the test set
y_pred = clf.predict(X_test_scaled)

# Build confusion matrix
cm = confusion_matrix(y_test_final, y_pred)
tn, fp, fn, tp = cm.ravel()

print("Confusion Matrix:")
print(cm)

# Accuracy
accuracy = accuracy_score(y_test_final, y_pred)

# Sensitivity (Recall for class 1)
sensitivity = recall_score(y_test_final, y_pred)

# Specificity = TN / (TN + FP)
specificity = tn / (tn + fp)

# Precision = TP / (TP + FP)
precision = precision_score(y_test_final, y_pred)

# Balanced accuracy = average of sensitivity and specificity
balanced_acc = balanced_accuracy_score(y_test_final, y_pred)

# Display metrics
print(f"\nAccuracy: {accuracy:.4f}")
print(f"Sensitivity (Recall): {sensitivity:.4f}")
print(f"Specificity: {specificity:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Balanced Accuracy: {balanced_acc:.4f}")
Confusion Matrix:
[[  0  12]
 [  0 386]]

Accuracy: 0.9698
Sensitivity (Recall): 1.0000
Specificity: 0.0000
Precision: 0.9698
Balanced Accuracy: 0.5000

Model Evaluation on Test Set

We evaluated the performance of the logistic regression model using the test set. The positive class represents users who renewed their pass (Renewed_Pass = 1).

Confusion Matrix:

Predicted 0 Predicted 1
Actual 0 (No) 0 12
Actual 1 (Yes) 0 386

Metrics:

  • Accuracy: 96.98%
  • Sensitivity (Recall): 100.00%
  • Specificity: 0.00%
  • Precision: 96.98%
  • Balanced Accuracy: 50.00%

Interpretation: The model perfectly predicts the positive class, but fails to identify any of the negative cases — it classifies every test observation as “Renewed” (class 1).
This results in high accuracy and sensitivity, but zero specificity, which suggests the model is biased due to strong class imbalance in the dataset (almost all users renewed).

Further improvement might include:

  • Applying class balancing techniques (e.g. SMOTE, class weights)
  • Exploring additional predictors or thresholds
  • Using evaluation metrics suited for imbalanced data (e.g. ROC AUC, F1-score)

O. Compare your model’s accuracy against the training set vs. accuracy against the test set (just use accuracy only for this).

a. What is the purpose of comparing those two values?

b. In this case, what does the comparison of those values suggest about the model that you have built?

In [17]:
# Accuracy on training set
train_preds = clf.predict(X_train_scaled)
train_accuracy = accuracy_score(y_train_final, train_preds)

# Accuracy on test set (уже рассчитан ранее)
test_accuracy = accuracy_score(y_test_final, y_pred)

print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")
Training Accuracy: 0.9731
Test Accuracy: 0.9698

Training vs. Test Accuracy Comparison

We compared the model's performance on the training and test sets:

  • Training Accuracy: 97.31%
  • Test Accuracy: 96.98%

a. What is the purpose of comparing those two values?

Comparing training and test accuracy helps evaluate the model's generalization:

  • A big gap may indicate overfitting
  • Very close values suggest that the model behaves similarly on new/unseen data

b. What does this comparison suggest in our case?

Since both accuracies are very high and very close (difference ≈ 0.3%), the model appears to generalize well.

However, accuracy alone may be misleading here because:

  • The dataset is highly imbalanced (most users renewed their pass)
  • The model predicted all test samples as the positive class (Renewed)

So, even though accuracy is high, the model fails to identify non-renewers, which is visible from:

  • Specificity = 0.00
  • Balanced Accuracy = 0.50

This suggests the model may need:

  • Rebalancing techniques (e.g., SMOTE or class weighting)
  • Use of better evaluation metrics (like AUC, F1-score) for imbalanced data

P. Make up a passholder. Assign this customer a value for each predictor variable in this model, and store the results in a new dataframe. Now, put your passholder through this model.

a. What did your model predict -- will this passholder renew?

b. According to your model, what is the probability that the passholder will renew?

In [18]:
selected_features = [
    'Total_Spend_2024',
    'Feedback_Score',
    'Home_State_VT',
    'Preferred_Attraction_Other',
    'Preferred_Attraction_Thrill',
    'Dining_Plan_Upgraded'
]
In [19]:
import pandas as pd

# Create a made-up passholder
new_passholder = pd.DataFrame([{
    'Total_Spend_2024': 275.00,       # high spending
    'Feedback_Score': 4.2,            # good experience
    'Home_State_VT': 0,               # not from Vermont
    'Preferred_Attraction_Other': 0,  # not other
    'Preferred_Attraction_Thrill': 1, # loves thrill
    'Dining_Plan_Upgraded': 1         # upgraded plan
}])

# Scale the input using the same scaler
new_scaled = scaler.transform(new_passholder)

# Predict class
prediction = clf.predict(new_scaled)[0]

# Predict probability
probability = clf.predict_proba(new_scaled)[0][1]  # Probability of class 1

print(f"Prediction (1 = Renew): {prediction}")
print(f"Probability of renewal: {probability:.4f}")
Prediction (1 = Renew): 1
Probability of renewal: 0.9984

Prediction for a Hypothetical Passholder

We created a fictional passholder with the following characteristics:

  • Total_Spend_2024: 275.00
  • Feedback_Score: 4.2
  • Home_State_VT: 0 (not from Vermont)
  • Preferred_Attraction_Thrill: 1
  • Preferred_Attraction_Other: 0
  • Dining_Plan_Upgraded: 1

a. What did the model predict?

  • Prediction (class): 1
    → The model predicts that this passholder will renew their pass.

b. What is the probability of renewal?

  • Predicted Probability: 99.84%
    → According to the model, this passholder has a very high likelihood of renewing.

This prediction aligns with the fact that the customer has high spending, gave strong feedback, and is engaged with premium park features.

Q. When using a logistic regression model to make predictions, why is it important to only use values within the range of the dataset used to build the model?

a. Make a new dataframe, but this time, for the numeric predictor variables, select some numbers that are outside the range of the dataset -- do not use a 400+ year-old vampire named “Mary.” Use your model to make a prediction for this new dataframe. What do you notice about the result? (To answer this, don’t simply state the predicted outcome, but also write 1-2 sentences of explanation for what you see).

In [20]:
# Made-up customer with values far outside the dataset's normal range
extreme_passholder = pd.DataFrame([{
    'Total_Spend_2024': 10000,     # extremely high spend
    'Feedback_Score': -5.0,        # invalid low feedback
    'Home_State_VT': 0,
    'Preferred_Attraction_Other': 1,
    'Preferred_Attraction_Thrill': 0,
    'Dining_Plan_Upgraded': 1
}])

# Scale and predict
extreme_scaled = scaler.transform(extreme_passholder)
extreme_pred = clf.predict(extreme_scaled)[0]
extreme_prob = clf.predict_proba(extreme_scaled)[0][1]

print(f"Prediction (1 = Renew): {extreme_pred}")
print(f"Probability of renewal: {extreme_prob:.4f}")
Prediction (1 = Renew): 1
Probability of renewal: 1.0000

Why is it important to only use values within the range of the training dataset?

Logistic regression models learn patterns from the training data and generalize within that observed range. If we input extreme or unrealistic values, especially for numeric variables, the model can produce misleading or overconfident predictions. This happens because the model tries to extrapolate beyond its knowledge, despite having no real-world basis for doing so.


a. What happens when we feed the model out-of-range values?

We created an extreme, unrealistic passholder with the following inputs:

  • Total_Spend_2024: 10,000 (massive outlier)
  • Feedback_Score: -5.0 (invalid negative value)
  • Other features: plausible

Model Output:

  • Predicted class: 1 (will renew)
  • Predicted probability: 100.00%

Interpretation: Although this passholder had nonsensical values (e.g., negative feedback), the model predicted renewal with perfect confidence. This result is not reliable, and shows that the model overconfidently extrapolated based on data it has never seen.

Conclusion: It's essential to keep predictor values within a realistic and observed range to ensure predictions are meaningful and trustworthy.

Part II: Random Forest Model

R. Read the dataset back into Python. For the steps you took in the previous section with regards to missingness and impossible values, repeat those here.

In [21]:
# Step 1: Read the original dataset again
df_raw = pd.read_csv('lobsterland_passholders_dataset.csv')

# Step 2: Handle missing values
# Fill missing Dining_Plan values with 'Unknown'
df_raw['Dining_Plan'] = df_raw['Dining_Plan'].fillna('Unknown')

# Step 3: Handle impossible values
# Remove rows where Total_Spend_2024 is negative
df_cleaned = df_raw[df_raw['Total_Spend_2024'] >= 0]

Re-importing and Re-cleaning the Dataset

To ensure consistency and reproducibility, we reloaded the original dataset and repeated all necessary data cleaning steps:

Missing Values:

  • The column Dining_Plan contained 323 missing values.
  • These were replaced with the category "Unknown" to retain all records while preserving the information.

Impossible Values:

  • 6 records had negative values in the Total_Spend_2024 column.
  • These were removed, as negative spending is not valid and cannot be reliably imputed.

After these steps, the dataset is now clean and ready for further analysis or modeling.

S. Dummify the categorical inputs again, but this time, don’t drop any levels.

In [22]:
# Identify categorical columns (excluding target)
categorical_vars = ['Home_State', 'Preferred_Attraction', 'Referral_Source', 'Dining_Plan']

# Dummify without dropping any levels
df_dummified = pd.get_dummies(df_cleaned, columns=categorical_vars, drop_first=False)

# View the resulting DataFrame
df_dummified.head()
Out[22]:
Age Previous_Visits Total_Spend_2024 Feedback_Score Gold_Zone_Visits Email_Engagement_Score Distance_From_Park_Miles Renewed_Pass Home_State_MA Home_State_ME ... Home_State_NY Home_State_VT Preferred_Attraction_Entertainment Preferred_Attraction_Other Preferred_Attraction_Thrill Referral_Source_Ad/Other Referral_Source_Friend Referral_Source_Social Media Dining_Plan_Unknown Dining_Plan_Upgraded
0 56 4 263.74 3.341462 2 94.9 13.9 1 False False ... False True False False True False False True True False
1 69 2 541.82 2.581981 1 28.2 28.5 1 False False ... True False False True False False True False False True
2 46 3 231.59 3.592377 3 46.3 41.2 1 True False ... False False False True False True False False True False
3 32 5 136.98 1.935378 0 56.7 20.7 1 False False ... False False False False True False True False False True
4 60 3 277.30 3.643427 4 95.6 45.3 1 False True ... False False False False True False False True False True

5 rows × 22 columns

Full Dummification (No Levels Dropped)

We transformed all categorical input variables into dummy (one-hot encoded) variables using pd.get_dummies() with drop_first=False.

This time, we retained all category levels, which means:

  • No reference category was dropped.
  • Each original category is now represented by its own binary (0/1) column.

This approach may introduce multicollinearity in linear models but can be helpful when we want full interpretability or use tree-based models that are not affected by linearly dependent inputs.

T. Re-partition the data, using the same seed value that you used in the previous part of this assignment.

In [23]:
from sklearn.model_selection import train_test_split

# Separate target and features
X = df_dummified.drop(columns=['Renewed_Pass'])
y = df_dummified['Renewed_Pass']

# Re-partition the data (60% train, 40% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.4, random_state=29
)

# Show sizes
print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")
Training set size: 595
Test set size: 398

Re-Partitioning the Data

We re-partitioned the fully dummified dataset using the same logic as before:

  • Training set size: 595 observations
  • Test set size: 398 observations
  • Random seed used: 29 (symbolic value chosen earlier in the project)

Using the same seed ensures that the split is consistent and reproducible, even though the input features have changed (e.g., now we retained all dummy variable levels).

U. Build a random forest model in Python with your training set. Use the same input variables, and same output variable, as you used in the first logistic regression model (the only difference here is that the categories should not have any levels dropped). Use GridSearchCV to help you determine the best hyperparameter settings for your model.

In [24]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

# Define base model
rf = RandomForestClassifier(random_state=29)

# Define parameter grid for GridSearchCV
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'max_features': ['sqrt', 'log2']
}

# Setup GridSearchCV
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

# Fit on training data
grid_search.fit(X_train, y_train)

# Best model
best_rf = grid_search.best_estimator_

# Predict on test set
y_pred_rf = best_rf.predict(X_test)

# Evaluate accuracy
rf_accuracy = accuracy_score(y_test, y_pred_rf)

# Output
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Test Set Accuracy: {rf_accuracy:.4f}")
Best Parameters: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
Test Set Accuracy: 0.9698

Random Forest Model with GridSearchCV

We trained a random forest classifier on the fully dummified dataset using the original input variables from the first logistic regression model. This time, however, we retained all dummy levels for the categorical variables.

Model Setup:

  • Target variable: Renewed_Pass
  • Input variables: All remaining features in the dummified dataset

Hyperparameter Optimization: We used GridSearchCV to tune the following parameters:

  • n_estimators: Number of trees in the forest
  • max_depth: Maximum depth of each tree
  • min_samples_split: Minimum number of samples required to split a node
  • min_samples_leaf: Minimum number of samples required at a leaf node
  • max_features: Number of features to consider when looking for the best split

Best Hyperparameters Found:

{
    'n_estimators': 100,
    'max_depth': None,
    'max_features': 'sqrt',
    'min_samples_split': 2,
    'min_samples_leaf': 1
}

Model Performance:
Test Set Accuracy: 96.98%

The model performs well and matches the accuracy of the logistic regression model, but likely benefits from the ability of random forests to handle complex interactions and nonlinear relationships.

V. How did your random forest model rank the variables in order of importance, from highest to lowest? For a random forest model, how can you interpret feature importance?

Feature Importance in Random Forest

a. Ranking of Variables (Highest to Lowest)

Based on the trained Random Forest model, here are the top features ranked by importance:

  1. Total_Spend_2024
  2. Feedback_Score
  3. Email_Engagement_Score
  4. Previous_Visits
  5. Dining_Plan_Upgraded
  6. Gold_Zone_Visits
  7. Distance_From_Park_Miles
  8. Preferred_Attraction_Thrill
  9. Referral_Source_Social Media
  10. Home_State_NY
  11. (other features with lower importance...)

These importance values reflect how much each variable contributed to reducing impurity (e.g., Gini index) in the decision trees that make up the forest.


b. How to Interpret Feature Importance in a Random Forest Model

Random Forest models estimate feature importance by measuring how much each feature improves the quality of the splits across all trees in the forest.

  • A feature is more "important" if it is frequently used for splits and contributes to a significant reduction in node impurity (like Gini or entropy).
  • The values are relative and sum to 1.0 across all features.
  • Higher values mean the feature plays a more critical role in making accurate predictions.

Important Notes:

  • Feature importance does not imply causality — it simply reflects predictive contribution.
  • Correlated features can share importance, which may dilute individual rankings.
  • These values are especially helpful in interpreting black-box models like Random Forests and guiding feature selection in future iterations.

W. Assess the performance of your model against the test set. Build a confusion matrix to do this. You can use Python functions to answer any of these questions or you can use your confusion matrix to determine the answers in a slightly more manual way. The ‘positive’ class in this model is represented by the “1” outcome.

a. What is your model’s accuracy rate?

b. What is your model’s sensitivity rate?

c. What is your model’s specificity rate?

d. What is your model’s precision?

e. What is your model’s balanced accuracy?

In [25]:
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score, balanced_accuracy_score

# Predict using the best Random Forest model
y_pred_rf = best_rf.predict(X_test)

# Confusion matrix
cm_rf = confusion_matrix(y_test, y_pred_rf)
tn, fp, fn, tp = cm_rf.ravel()

print("Confusion Matrix:")
print(cm_rf)

# Accuracy
accuracy_rf = accuracy_score(y_test, y_pred_rf)

# Sensitivity (Recall for class 1)
sensitivity_rf = recall_score(y_test, y_pred_rf)

# Specificity = TN / (TN + FP)
specificity_rf = tn / (tn + fp)

# Precision = TP / (TP + FP)
precision_rf = precision_score(y_test, y_pred_rf)

# Balanced accuracy
balanced_acc_rf = balanced_accuracy_score(y_test, y_pred_rf)

# Display all metrics
print(f"\nAccuracy: {accuracy_rf:.4f}")
print(f"Sensitivity (Recall): {sensitivity_rf:.4f}")
print(f"Specificity: {specificity_rf:.4f}")
print(f"Precision: {precision_rf:.4f}")
print(f"Balanced Accuracy: {balanced_acc_rf:.4f}")
Confusion Matrix:
[[  0  12]
 [  0 386]]

Accuracy: 0.9698
Sensitivity (Recall): 1.0000
Specificity: 0.0000
Precision: 0.9698
Balanced Accuracy: 0.5000

Model Evaluation Using Confusion Matrix

We evaluated the performance of the Random Forest model on the test set.

Confusion Matrix:

Predicted 0 Predicted 1
Actual 0 (No) 0 12
Actual 1 (Yes) 0 386

a. Accuracy Rate:

  • Accuracy = (TP + TN) / Total = (386 + 0) / 398 = 96.98%

b. Sensitivity (Recall for class 1):

  • Sensitivity = TP / (TP + FN) = 386 / (386 + 0) = 1.0000 (100%)

c. Specificity (Recall for class 0):

  • Specificity = TN / (TN + FP) = 0 / (0 + 12) = 0.0000 (0%)

d. Precision:

  • Precision = TP / (TP + FP) = 386 / (386 + 12) = 0.9698 (96.98%)

e. Balanced Accuracy:

  • Balanced Accuracy = (Sensitivity + Specificity) / 2 = (1.0 + 0.0) / 2 = 0.5000 (50%)

Interpretation:

While the model shows very high overall accuracy (96.98%) and perfect recall for the positive class, it fails to correctly classify any negative examples — it predicts everyone will renew their pass.
This is likely due to severe class imbalance in the dataset (most users renewed), which leads the model to favor the majority class.

To improve performance on the minority class (non-renewers), we could:

  • Apply class balancing techniques (e.g., SMOTE, class weights)
  • Explore alternative metrics (F1-score, ROC AUC)
  • Adjust the decision threshold for classification

X. Compare your model’s accuracy against the training set vs. your model’s accuracy against the test set. How different were these results?

In [26]:
# Accuracy on training set
y_train_pred_rf = best_rf.predict(X_train)
train_accuracy_rf = accuracy_score(y_train, y_train_pred_rf)

# Accuracy on test set (уже есть)
print(f"Training Accuracy: {train_accuracy_rf:.4f}")
print(f"Test Accuracy: {accuracy_rf:.4f}")
Training Accuracy: 1.0000
Test Accuracy: 0.9698

Training vs. Test Accuracy Comparison (Random Forest)

Accuracy Results:

  • Training Accuracy: 100.00%
  • Test Accuracy: 96.98%

Interpretation:

The model achieved perfect accuracy on the training set, but slightly lower accuracy on the test set.

This difference indicates that the model is likely experiencing overfitting — it has perfectly memorized the training data, but performs slightly worse on new, unseen data.

However, given the test accuracy of 96.98%, the model is still generalizing quite well overall. This suggests that although overfitting is present, the model's performance on the test set remains strong, likely due to its ability to capture underlying patterns without fully memorizing the data.

To improve:

  • Applying regularization or hyperparameter tuning to prevent overfitting
  • Using other metrics like ROC AUC to evaluate model performance in more detail

Y. Use the predict() function with your model to classify the person who you invented in the previous section. Does the model think that the passholder will renew?

(Note: This question says to “classify the person.” It does not say that the dataframe will be set up in the exact same way).

In [30]:
# Step 1: Create a DataFrame with all columns from training data, filled with zeros
new_passholder_full = pd.DataFrame(0, index=[0], columns=X_train.columns)

# Step 2: Set realistic values for numeric variables
new_passholder_full['Age'] = 35                      # Example age
new_passholder_full['Previous_Visits'] = 3           # Example visits
new_passholder_full['Total_Spend_2024'] = 275.00     # High spending
new_passholder_full['Feedback_Score'] = 4.2          # Positive feedback
new_passholder_full['Gold_Zone_Visits'] = 2          # Example visits
new_passholder_full['Email_Engagement_Score'] = 60   # Example engagement
new_passholder_full['Distance_From_Park_Miles'] = 25 # Example distance

# Step 3: Set categorical variables explicitly
new_passholder_full['Home_State_VT'] = 0                  # Not from Vermont
new_passholder_full['Home_State_MA'] = 1                  # From Massachusetts
new_passholder_full['Preferred_Attraction_Thrill'] = 1    # Prefers thrill attractions
new_passholder_full['Preferred_Attraction_Entertainment'] = 0  # Does not prefer entertainment
new_passholder_full['Preferred_Attraction_Other'] = 0         # Not 'Other'
new_passholder_full['Referral_Source_Social Media'] = 1       # Referred by Social Media
new_passholder_full['Dining_Plan_Upgraded'] = 1               # Dining plan upgraded

# Step 4: Predict using the Random Forest model
prediction_rf = best_rf.predict(new_passholder_full)[0]

# Step 5: Probability of renewal
probability_rf = best_rf.predict_proba(new_passholder_full)[0][1]

# Output results
print(f"Prediction (1 = Renew): {prediction_rf}")
print(f"Probability of renewal: {probability_rf:.4f}")
Prediction (1 = Renew): 1
Probability of renewal: 1.0000

Model Prediction for the Invented Passholder

We classified a fictional passholder using the trained Random Forest model. The passholder had the following characteristics:

  • Age: 35
  • Previous Visits: 3
  • Total Spend (2024): $275.00 (high)
  • Feedback Score: 4.2 (positive)
  • Home State: Massachusetts (not Vermont)
  • Preferred Attraction: Thrill
  • Referral Source: Social Media
  • Dining Plan: Upgraded

Prediction Results:

  • Predicted class (1 = Renew): 1
  • Probability of renewal: 100.00%

Interpretation: The Random Forest model strongly predicts that this passholder will renew their pass, assigning an extremely high confidence (100%) to this prediction. This result aligns with the passholder’s positive attributes, such as high spending, excellent feedback, and engagement with premium offerings at the park.

Z. For this question, no Python code is required -- just use a Markdown cell to answer. Write a 3-5 sentence paragraph that speculates about how Lobster Land might be able to use the results that you’ve obtained from these models (LR and/or RF) for a practical purpose.

Practical Application of the Model Results for Lobster Land

Based on our analysis, Lobster Land can effectively use insights from the predictive models to boost passholder retention. For example, our Random Forest model identified Total_Spend_2024 and Feedback_Score as the two most important factors influencing renewal decisions. Given that a customer spending around $275 with a high feedback score of 4.2 had a nearly 100% probability of renewal, Lobster Land could strategically incentivize moderate-spending customers (for example, those spending around the overall mean of $250) to slightly increase their expenditure, significantly improving their likelihood to renew. Conversely, customers from Vermont (Home_State_VT) showed notably lower renewal rates, with a negative coefficient of approximately -1.68 in the logistic regression model, indicating a targeted regional campaign or special discounts could address this specific geographic disadvantage. Furthermore, our confusion matrix revealed a sensitivity (recall) of 100% but specificity of 0%, indicating the current model tends to overlook non-renewers; thus, introducing more balanced incentives or alternative outreach methods specifically for "at-risk" customers could enhance overall retention. By grounding these strategies directly in our quantified model outcomes, Lobster Land can confidently tailor and prioritize initiatives for maximum effectiveness and measurable impact.

Part III: Using Tableau to Build a Dashboard (1 point):

In [2]:
from IPython.display import Image, display

display(Image(filename='dashboard.png'))
No description has been provided for this image

Dashboard Description

This dashboard presents a multi-faceted view of visitor activity and operational patterns at Lobster Land, based on data from summer 2024. It consists of four distinct visualizations, each created with intention and assembled into a single dashboard to provide a comprehensive snapshot.

The first chart, "Daily Total Visitors Over Time", is a line graph built by placing the Date field on the Columns shelf and Total_Visitors on the Rows shelf. This allowed us to visualize trends across time and quickly identify spikes or drops in attendance. The second chart, "Average Arcade Revenue by Day of Week", was created as a bar chart by placing Day_of_Week on Columns and aggregating Arcade_Revenue (as Average) on Rows. This chart helps compare how weekdays and weekends influence arcade earnings.

For deeper operational insights, we built a scatter plot titled "Relationship between Total Labor Hours and Total Purchases" by placing Total_Purchases on the X-axis, Total_Labor_Hours on the Y-axis, and coloring points by Weather_Type to examine environmental effects on business efficiency. Lastly, the "Customer Complaints by Weather Type" visualization was created using the Treemap/Bubble chart format by placing Weather_Type on Detail and Customer_Complaints (aggregated as Sum) on Size and Color. This highlights which weather types are most associated with guest dissatisfaction.

All four visualizations were then brought together in a single dashboard using Tableau's drag-and-drop dashboard editor. Each plot was sized and positioned carefully for clarity, and individual titles were added to make the purpose of each visualization clear. The dashboard layout was designed to balance trends over time (top half) with cross-variable relationships (bottom half), providing a strategic and readable format for decision-making.