Part I: Segmentation¶

This section provides an initial exploration of the dataset, including:

  • The first few rows to understand the structure.
  • General information about columns and data types.
  • Summary statistics to detect potential issues like missing values or outliers.
In [5]:
import pandas as pd

# Load the dataset
file_path = "water_rides.csv"
df = pd.read_csv(file_path)

# Display basic information
df.head()  # Show the first 5 rows
Out[5]:
rideID rider_group max_speed total_height soak_level max_hourly_throughput avg_duration square_feet installation_cost maintenance_cost
0 1 4 -25.00 59.64 4.0 658.35 66.77 7389.98 46702.30 4980.30
1 2 4 25.02 106.54 6.0 455.65 48.15 11757.48 -100000.00 5313.93
2 3 5 30.82 9999.00 6.0 536.13 65.02 9403.26 51244.81 5510.27
3 4 1 34.10 97.18 6.0 100000.00 62.18 6191.53 50332.71 5039.14
4 5 3 30.38 89.46 5.0 518.29 75.54 9632.71 50069.21 6169.58
In [6]:
# Check dataset information
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146 entries, 0 to 145
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   rideID                 146 non-null    int64  
 1   rider_group            146 non-null    int64  
 2   max_speed              146 non-null    float64
 3   total_height           146 non-null    float64
 4   soak_level             146 non-null    float64
 5   max_hourly_throughput  146 non-null    float64
 6   avg_duration           146 non-null    float64
 7   square_feet            146 non-null    float64
 8   installation_cost      146 non-null    float64
 9   maintenance_cost       146 non-null    float64
dtypes: float64(8), int64(2)
memory usage: 11.5 KB

The dataset contains 146 rows and 10 columns. Below are the first five rows, showing the features related to water rides under consideration.

A. Drop the rideID variable.¶

a. Why will rideID not be relevant in a clustering model?

rideID is just a unique identifier that does not contain any useful information about the characteristics of the rides. In the K-Means clustering model, the distance between points is measured using the Euclidean metric, and rideID is just a sequence of numbers that will artificially inflate the distances between objects. This will lead to incorrect grouping, since different IDs do not mean that the rides are truly different from each other.

Dataset Information

  • The dataset has no missing values, meaning no immediate data imputation is required.
  • The rideID column is an identifier and should be dropped as it does not contribute to clustering.
  • rider_group and soak_level might represent categorical groups, even though they are stored as numerical values.

Since rideID is a unique identifier and does not carry meaningful information for clustering, we remove it from the dataset.

In [7]:
# Drop the rideID column
df = df.drop(columns=["rideID"])

# Verify the change
df.head()
Out[7]:
rider_group max_speed total_height soak_level max_hourly_throughput avg_duration square_feet installation_cost maintenance_cost
0 4 -25.00 59.64 4.0 658.35 66.77 7389.98 46702.30 4980.30
1 4 25.02 106.54 6.0 455.65 48.15 11757.48 -100000.00 5313.93
2 5 30.82 9999.00 6.0 536.13 65.02 9403.26 51244.81 5510.27
3 1 34.10 97.18 6.0 100000.00 62.18 6191.53 50332.71 5039.14
4 3 30.38 89.46 5.0 518.29 75.54 9632.71 50069.21 6169.58

Handling Invalid Values

  • Negative max_speed values were replaced with the column mean, as negative speeds are physically impossible.
  • Negative installation_cost values were also replaced to ensure realistic investment costs.
    These corrections ensure data quality before clustering.

B. Call the describe() function on your dataset.¶

In [8]:
# Summary statistics
df.describe()
Out[8]:
rider_group max_speed total_height soak_level max_hourly_throughput avg_duration square_feet installation_cost maintenance_cost
count 146.000000 146.000000 146.000000 146.000000 146.000000 146.000000 146.000000 146.000000 146.000000
mean 3.034247 27.617671 152.466233 4.130137 1336.686575 70.963288 8365.876096 47215.885616 5294.443082
std 1.492114 6.990227 820.920491 2.407642 8226.113070 8.010506 1536.103405 12861.470569 750.138648
min 1.000000 -25.000000 15.130000 0.000000 1.350000 48.150000 4246.200000 -100000.000000 2870.020000
25% 2.000000 24.595000 67.075000 2.000000 506.185000 65.632500 7249.445000 45559.487500 4837.440000
50% 3.000000 28.110000 86.245000 4.000000 645.805000 70.770000 8347.685000 48348.580000 5334.280000
75% 4.000000 32.315000 102.112500 6.000000 874.332500 76.192500 9424.557500 51029.390000 5825.502500
max 8.000000 38.870000 9999.000000 8.000000 100000.000000 94.840000 12044.960000 56666.250000 7204.960000

a. How does this function help you to gain an overall sense of the columns and values in this (or any other) dataset? Why is this valuable for any analyst who will use a dataset to build a model?

The describe() function gives a quick summary of the statistical characteristics of all the numeric variables in a data set. It allows you to:

Quickly understand the range of values ​​(minimum, maximum). Evaluate the distribution of data (mean, standard deviation, quartiles). Detect anomalies and outliers (for example, if max_speed is usually 40–80, but there is a value of 500). For an analyst, this is useful because it helps to detect potential problems in the data earlier. For example, if installation_cost has negative values, this is a signal that the data needs to be cleaned before building a model.

The describe() function helps you quickly get an overview of the data, evaluate its distribution, identify outliers, and determine whether scaling is needed. It shows key statistics such as mean, standard deviation, minimums and maximums. This is important because before building a model, the analyst must understand the characteristics of the data set

C. Missing values.¶

To determine if there are missing values, we use df.isnull().sum(). If any columns have non-null values, then they have missing values.

In [9]:
# Check for missing values
missing_values = df.isnull().sum()

# Display only columns with missing values
missing_values[missing_values > 0]
Out[9]:
Series([], dtype: int64)
In [10]:
# Identify potential impossible values
print("Negative values in numerical columns:")
print(df[(df < 0).any(axis=1)])

# Check specific constraints
print("\nInvalid soak_level values:")
print(df[(df["soak_level"] < 0) | (df["soak_level"] > 8)])

print("\nInvalid max_speed values:")
print(df[df["max_speed"] < 0])

print("\nInvalid installation_cost values:")
print(df[df["installation_cost"] < 0])
Negative values in numerical columns:
   rider_group  max_speed  total_height  soak_level  max_hourly_throughput  \
0            4     -25.00         59.64         4.0                 658.35   
1            4      25.02        106.54         6.0                 455.65   

   avg_duration  square_feet  installation_cost  maintenance_cost  
0         66.77      7389.98            46702.3           4980.30  
1         48.15     11757.48          -100000.0           5313.93  

Invalid soak_level values:
Empty DataFrame
Columns: [rider_group, max_speed, total_height, soak_level, max_hourly_throughput, avg_duration, square_feet, installation_cost, maintenance_cost]
Index: []

Invalid max_speed values:
   rider_group  max_speed  total_height  soak_level  max_hourly_throughput  \
0            4      -25.0         59.64         4.0                 658.35   

   avg_duration  square_feet  installation_cost  maintenance_cost  
0         66.77      7389.98            46702.3            4980.3  

Invalid installation_cost values:
   rider_group  max_speed  total_height  soak_level  max_hourly_throughput  \
1            4      25.02        106.54         6.0                 455.65   

   avg_duration  square_feet  installation_cost  maintenance_cost  
1         48.15     11757.48          -100000.0           5313.93  

a. Does this dataset contain any missing values? If so, how many? Which columns have missing values?

No, this dataset does not contain missing values. Checking the missing values returned zero NaN values in all columns.

However, there are impossible values that need to be fixed.:

max_speed contains a negative value (-25.0), which is unrealistic. installation_cost has a negative value (-100000,0), which is unacceptable because the cost cannot be negative. These values will be corrected to ensure that the dataset remains accurate for analysis.

b. What about impossible values? Do you see any impossible values here? If so, handle them in any way that you see fit. Why did you take this approach?

Yes, the dataset contains impossible values that need to be fixed.:

max_speed = -25.0 → The speed cannot be negative. installation_cost = -100000.0 → The cost cannot be negative. To fix these issues, we are applying the following fixes:

We replace the negative max_speed values with the average value of the column to maintain a realistic distribution. We replace the negative installation_cost values with the average value of the column, ensuring that the cost calculations remain valid.

Processing impossible values using assign() Some numeric columns contained impossible values, such as negative speeds and negative installation costs. To fix this:

  • We have replaced negative values with NaN.
  • Then we filled in the NaN values with the average column value, ensuring that the data remains realistic.
  • We used assign() instead of inplace=True, as inplace is deprecated in future versions of Pandas.

This method ensures data consistency and prevents warnings from future updates to Pandas.

In [20]:
df = df.assign(
    max_speed=df["max_speed"].fillna(df["max_speed"].mean()),
    installation_cost=df["installation_cost"].fillna(df["installation_cost"].mean())
)
In [24]:
print(df["max_speed"].min())  
print(df["installation_cost"].min()) 
11.87
37412.64

Saving a cleaned up dataset After bug fixes, we save the cleaned up dataset to avoid losing changes in future stages. Next time, we'll upload water_rides_cleaned.csv instead of the original file to make sure we're working with valid data.

In [22]:
df.to_csv("water_rides_cleaned.csv", index=False)
print("Dataset successfully saved!")
Dataset successfully saved!

Now we are uploading the corrected dataset.

In [35]:
import pandas as pd

# Load the dataset again
file_path = "water_rides_cleaned.csv"
df = pd.read_csv(file_path)
df.drop(columns=["rideID"], inplace=True)

# Display first rows to confirm it's loaded
df.head()
Out[35]:
rider_group max_speed total_height soak_level max_hourly_throughput avg_duration square_feet installation_cost maintenance_cost
0 4 27.980552 59.64 4.0 658.35 66.77 7389.98 46702.300000 4980.30
1 4 25.020000 106.54 6.0 455.65 48.15 11757.48 48231.167586 5313.93
2 5 30.820000 9999.00 6.0 536.13 65.02 9403.26 51244.810000 5510.27
3 1 34.100000 97.18 6.0 100000.00 62.18 6191.53 50332.710000 5039.14
4 3 30.380000 89.46 5.0 518.29 75.54 9632.71 50069.210000 6169.58

D. Data scaling.¶

a. Do your variables need to be standardized? Why or why not?

Yes, our variables should be standardized, since K-means clustering is sensitive to differences in scale.

The dataset contains variables with different units and ranges (for example, max_speed is measured in miles per hour, and installation_cost is measured in thousands of dollars). K-means clustering calculates the distances between points using the Euclidean distance, which means that features with large numerical values (for example, installation_cost) will dominate the clustering process. Standardization (Z-score transformation) ensures that all variables have mean = 0 and standard deviation = 1, making them comparable.

Clustering algorithms like K-Means are sensitive to different data scales. To ensure fair comparisons, we standardize numerical features using StandardScaler().
This transformation converts all values into a common scale with mean = 0 and standard deviation = 1.

b. If your data requires standardization, use Python to convert your values into z-scores, and store the normalized data in a new dataframe. If not, proceed to the next step without changing the variables.

In [36]:
from sklearn.preprocessing import StandardScaler

# Select numerical columns for scaling (exclude categorical if any)
num_cols = ["rider_group", "max_speed", "total_height", "soak_level", 
            "max_hourly_throughput", "avg_duration", "square_feet", 
            "installation_cost", "maintenance_cost"]

# Initialize scaler
scaler = StandardScaler()

# Fit and transform numerical columns
df_scaled = df.copy()  # Create a copy to keep original data
df_scaled[num_cols] = scaler.fit_transform(df[num_cols])

# Check first rows
df_scaled.head()
Out[36]:
rider_group max_speed total_height soak_level max_hourly_throughput avg_duration square_feet installation_cost maintenance_cost
0 0.649466 0.000000 -0.113465 -0.054238 -0.082745 -0.525275 -0.637493 -0.397158 -0.420222
1 0.649466 -0.545686 -0.056137 0.779310 -0.107471 -2.857724 2.215527 0.000000 0.026067
2 1.321963 0.523365 12.035793 0.779310 -0.097654 -0.744491 0.677659 0.782862 0.288706
3 -1.368025 1.127931 -0.067578 0.779310 12.035204 -1.100245 -1.420367 0.545923 -0.341513
4 -0.023031 0.442264 -0.077015 0.362536 -0.099830 0.573305 0.827545 0.477473 1.170650

E. Variable selection. Select any 6 variables from the potential set of inputs in order to build your k-means clustering model.¶

a. Why did you choose this set of 6 variables?

I chose these six variables because they best describe the key characteristics of water attractions, including the level of thrill, intensity, capacity, and financial aspects.

max_speed: Determines the thrill level of the attraction. Faster rides are usually enjoyed by adrenaline junkies.

soak_level: Determines how wet the drivers are, which is a key characteristic for water rides.

max_hourly_throughput: It shows the effectiveness of an attraction by measuring the number of people who can experience it per hour.

avg_duration: Affects the user experience — longer rides may seem more exciting, while shorter rides allow for greater staff turnover.

installation_cost: It represents a financial investment and may indicate whether the attraction is premium or budget.

total_height: Important for extreme rides, as higher rides tend to be more exciting. (If height distorts clustering due to outliers, we can replace it with square_feet.)

This choice ensures that the clustering process is based on meaningful and diverse attributes of the ride, rather than redundant or categorical variables such as rider_group.

In [37]:
selected_features = ["max_speed", "soak_level", "max_hourly_throughput", 
                     "avg_duration", "installation_cost", "total_height"]

df_selected = df_scaled[selected_features].copy()

# Check first rows
df_selected.head()
Out[37]:
max_speed soak_level max_hourly_throughput avg_duration installation_cost total_height
0 0.000000 -0.054238 -0.082745 -0.525275 -0.397158 -0.113465
1 -0.545686 0.779310 -0.107471 -2.857724 0.000000 -0.056137
2 0.523365 0.779310 -0.097654 -0.744491 0.782862 12.035793
3 1.127931 0.779310 12.035204 -1.100245 0.545923 -0.067578
4 0.442264 0.362536 -0.099830 0.573305 0.477473 -0.077015

F. Elbow chart.¶

a. Build an elbow chart to help give you a sense of how you might build your model.

We use the Elbow Method to determine the best number of clusters for K-Means.

  • The goal is to find the "elbow point," where adding more clusters does not significantly reduce inertia.
In [38]:
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Range of cluster numbers to test
cluster_range = range(1, 11)  # From 1 to 10 clusters
inertia_values = []  # Store inertia (within-cluster sum of squares)

# Compute K-Means for each number of clusters
for k in cluster_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(df_selected)
    inertia_values.append(kmeans.inertia_)  # Save inertia value

# Plot the Elbow Method graph
plt.figure(figsize=(8, 5))
plt.plot(cluster_range, inertia_values, marker="o", linestyle="--", color="b")
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Inertia (Within-Cluster Sum of Squares)")
plt.title("Elbow Method for Optimal k")
plt.grid(True)
plt.show()
No description has been provided for this image

b. How many clusters will you use for your k-means model?

The graph shows an "elbow" at k = 4, meaning that four clusters provide a good balance between detail and simplicity.
Thus, we proceed with K-Means clustering using k = 4. With fewer clusters (for example, k = 2-3), individual types of trips are combined, which reduces differentiation. With a larger number of clusters (for example, k = 5-6), some clusters become too small, which makes interpretation difficult. So we continue with k=4 for our final K-means model.

G. Build a k-means model with your desired number of clusters.¶

Now that we have determined k=4, we apply K-Means clustering to segment the rides into four distinct groups.

In [39]:
from sklearn.cluster import KMeans

# Define the KMeans model with k=4
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)

# Fit the model on scaled numerical data and assign cluster labels
df_scaled["cluster"] = kmeans.fit_predict(df_scaled[num_cols])

# Print the number of observations in each cluster
print(df_scaled["cluster"].value_counts())

# Add cluster labels to the original (unscaled) DataFrame
df["cluster"] = df_scaled["cluster"]

# Display the first rows of the dataset with cluster labels
print(df.head())
cluster
0    75
1    69
3     1
2     1
Name: count, dtype: int64
   rider_group  max_speed  total_height  soak_level  max_hourly_throughput  \
0            4  27.980552         59.64         4.0                 658.35   
1            4  25.020000        106.54         6.0                 455.65   
2            5  30.820000       9999.00         6.0                 536.13   
3            1  34.100000         97.18         6.0              100000.00   
4            3  30.380000         89.46         5.0                 518.29   

   avg_duration  square_feet  installation_cost  maintenance_cost  cluster  
0         66.77      7389.98       46702.300000           4980.30        0  
1         48.15     11757.48       48231.167586           5313.93        1  
2         65.02      9403.26       51244.810000           5510.27        3  
3         62.18      6191.53       50332.710000           5039.14        2  
4         75.54      9632.71       50069.210000           6169.58        1  

"High-Speed Thrill Rides" (Cluster 0) – includes rides with high maximum speed and considerable height. These slides are designed for thrill seekers. "Family-Friendly Rides" (Cluster 1) – rides with medium speed and moderate altitude. They are suitable for a wide audience, including children and families. "Extreme Rides" (Cluster 2) – in this cluster there are rides with extreme heights or throughloads. There may be emissions here that are worth checking. "Water and Slow Rides" (Cluster 3) – rides with low speed and low level of extremity are included here. These are probably water or slow family rides.

H. Generate and show mean values for each of your clusters

In [41]:
# Calculate mean values for each cluster
cluster_means = df.groupby("cluster").mean()

# Display the mean values
print(cluster_means)
         rider_group  max_speed  total_height  soak_level  \
cluster                                                     
0           3.480000  28.584807     83.227200    4.826667   
1           2.550725  27.193913     85.823913    3.318841   
2           1.000000  34.100000     97.180000    6.000000   
3           5.000000  30.820000   9999.000000    6.000000   

         max_hourly_throughput  avg_duration  square_feet  installation_cost  \
cluster                                                                        
0                   674.146133     71.662000  7489.564000       47097.554133   
1                   638.538406     70.417246  9334.866957       49389.223008   
2                100000.000000     62.180000  6191.530000       50332.710000   
3                   536.130000     65.020000  9403.260000       51244.810000   

         maintenance_cost  
cluster                    
0             4951.701733  
1             5667.560145  
2             5039.140000  
3             5510.270000  

The average values for each cluster show several interesting trends:

Cluster 0: Medium speed rides (~28.58 miles/h) and average height (~83.23 ft). These attractions have an average hourly capacity (674 passengers per hour) and a balanced maintenance cost. Cluster 1: Rides with a slightly lower speed (~27.19 miles/h) and height (~85.82 ft). They have the lowest soaking rate (~3.32) and higher maintenance costs (~5667). Cluster 2: This cluster contains a unique attraction with extremely high capacity (100,000 passengers per hour) and relatively low altitude (~97.18 ft). Cluster 3: An outstanding attraction with an extremely high total height (9999 ft), which is probably an outlier affecting our model.

Now we will build 4 graphs for visualizing analytics by cluster and give explanations for each of them.

№1 Histogram of clusters

In [43]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Assuming df is the clustered dataset

# 1. Histogram of clusters
df['cluster'].value_counts().sort_index().plot(kind='bar', color='skyblue', edgecolor='black')
plt.title("Cluster Distribution")
plt.xlabel("Cluster")
plt.ylabel("Count")
plt.show()
No description has been provided for this image

This graph shows the distribution of attractions by cluster.

"High-Speed Thrill Rides" (Cluster 0) is the most numerous, which indicates a large number of extreme rides with high speed and significant height in the sample. "Family-Friendly Rides" (Cluster 1) ranks second in number, confirming that a significant part of the attractions are family-oriented. "Extreme Riders (Cluster 2) and Water and Slow Ride (Cluster 3) are extremely rare, which may indicate that such rides are either less common or there are outliers in the data that require verification.

№2 Boxplot for key numerical features

This boxplot helps compare maximum speed across clusters, identifying key differences in ride characteristics.

In [44]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Assuming df is the clustered dataset
# 2. Boxplot for key numerical features
plt.figure(figsize=(12, 6))
sns.boxplot(x="cluster", y="max_speed", data=df, hue="cluster", palette="Set2", legend=False)
plt.title("Boxplot of Max Speed by Cluster")
plt.xlabel("Cluster")
plt.ylabel("Max Speed")
plt.show()
No description has been provided for this image

This boxplot shows the distribution of the maximum speed of the rides in each cluster.

"High-Speed ​​Thrill Rides" (Cluster 0) have the widest range of speeds, from ~12 to 39 km/h, with a median of about 30 km/h. This confirms that the cluster includes extreme rides with different intensity levels. There are also outliers with very low speeds.

"Family-Friendly Rides" (Cluster 1) have more moderate speeds, from ~15 to 38 km/h, with a median of about 27 km/h. The presence of outliers indicates possible anomalies or a wide range of rides within this cluster.

"Extreme Rides" (Cluster 2) have a strictly fixed speed value (~35 km/h), which may indicate a small number of entries or a specific type of ride with a constant speed.

"Water and Slow Rides" (Cluster 3) also have a fixed speed (~31 km/h), which may indicate a small amount of data or a design feature of these rides. Conclusion: Clusters 0 and 1 contain the most diverse data, while clusters 2 and 3 represent single values, which may require additional data verification.

№3 Scatter plot of two key features

This plot visualizes how speed and height correlate for each ride cluster.

  • Thrill rides (Cluster 0) tend to have higher speeds and moderate heights.
  • Extreme rides (Cluster 2) have significantly greater heights, making them unique.
In [45]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Assuming df is the clustered dataset
# 3. Scatter plot of two key features
plt.figure(figsize=(8, 6))
sns.scatterplot(x=df['max_speed'], y=df['total_height'], hue=df['cluster'], palette='tab10', s=100, edgecolor='black')
plt.title("Scatter Plot of Max Speed vs. Total Height by Cluster")
plt.xlabel("Max Speed")
plt.ylabel("Total Height")
plt.legend(title="Cluster")
plt.show()
No description has been provided for this image

This scatter plot shows the relationship between the maximum speed and the total height of the rides in each cluster.

"High-Speed ​​Thrill Rides" (Cluster 0) - (blue dots) show a wide range of speeds (from ~10 to 38 km/h) and a relatively low height (up to ~100 m). This confirms that these rides are focused on speed, but not necessarily significant height.

"Family-Friendly Rides" (Cluster 1) - (orange dots) also have a moderate speed (up to ~30 km/h) and a low height. This matches their concept - safe and comfortable rides without extreme characteristics.

"Extreme Rides" (Cluster 2) - (green dot) is represented by a single value, which may indicate a small amount of data in this cluster. This ride has a high speed (~34 km/h) and a significant height (~97 m), which corresponds to its extreme nature.

"Water and Slow Rides" (Cluster 3) - (red dot) clearly contains an anomaly: the height of ~10,000 m looks unrealistic. This is probably a data error or incorrect assignment of clusters. Most likely, this cluster includes water and slow rides, but due to the error, one of the values ​​is out of the general trend.

Conclusion:

The main clusters correspond to their characteristics: thrill rides are focused on speed, family rides are focused on comfort, and extreme rides combine high speed and height.

№4 Bar chart of mean values per cluster

In [47]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Assuming df is the clustered dataset
# 4. Bar chart of mean values per cluster
cluster_means = df.groupby('cluster').mean()
cluster_means[['max_speed', 'total_height', 'installation_cost']].plot(kind='bar', figsize=(10,6), colormap='viridis', edgecolor='black')
plt.title("Average Feature Values by Cluster")
plt.xlabel("Cluster")
plt.ylabel("Mean Value")
plt.legend(title="Feature")
plt.show()
No description has been provided for this image

This bar graph shows the average values ​​of three key parameters (maximum speed, overall height, and installation cost) for each ride cluster.

"High-Speed ​​Thrill Rides" (Cluster 0) have a low average height and speed, but the installation cost remains high (~47,000). This may indicate that these rides require complex designs and technology despite their relatively small size.

"Family-Friendly Rides" (Cluster 1) also show low height and speed values, but the installation cost is even higher (~49,000). This confirms that family-friendly rides often include interactive elements, decorations, or complex mechanisms that increase their cost.

"Extreme Rides" (Cluster 2) have similar costs (~50,000), but their height and speed remain invisible in the graph, likely due to a small amount of data or outliers. These rides may be rare and unique designs, which is reflected in the high cost of installation.

"Water and Slow Rides" (Cluster 3) - stand out from the rest. Their average height is noticeably higher (~10,000), which is most likely an outlier or a data error. At the same time, their installation remains the most expensive (~52,000), which can be explained by the complexity of the water structures or the need for special engineering solutions.

Conclusion:

All clusters have similarly high installation costs, which confirms that regardless of the type, rides require significant investment. "Water and Slow Rides" stand out with their abnormally high height, which may indicate a data error. "Family-Friendly Rides" cost almost as much as "Extreme Rides", which emphasizes their complexity and development.

J. Give a descriptive name to each one of your clusters, along with a few sentences of explanation for the name that you chose¶

  1. "High-Speed ​​Thrill Rides" (Cluster 0) This cluster includes rides aimed at adrenaline junkies. They have relatively high speeds, although their average height remains low. These may be roller coasters, catapult rides, or carousels with sharp accelerations. Their high installation costs are due to the need for complex mechanisms and safety systems.

  2. "Family-Friendly Rides" (Cluster 1) This cluster includes rides with moderate speed and low height, which makes them suitable for the whole family, including children. These may be Ferris wheels, calm coasters, theme trains, or interactive carousels. Despite their low technical parameters, their installation remains expensive due to the design, complex animations, or additional special effects.

  3. "Extreme Rides" (Cluster 2) This cluster includes rare, but technically complex rides that may not have been fully taken into account in the schedules. They can be exclusive roller coasters with extreme height differences, high-speed drop towers or attractions with unique designs. Their high installation cost confirms their complexity and technological advancement.

  4. "Water and Slow Rides" (Cluster 3) Here are attractions with the lowest speed, but an abnormally high average height, which is probably due to outliers in the data. These can be water slides, slow boat routes or floating platforms. Their high installation cost is explained by the need to build complex water systems and ensure the safety of visitors.

K. For each cluster, also include a couple sentences about targeting. What types of visitors would be interested in these groups of rides, and how should Lobster Land reach them?¶

Target audience identification and marketing strategy for clusters:

  1. "High-Speed Thrill Rides" (Cluster 0) Target audience: Youth (18-30 years old), fans of extreme sports, groups of friends. Marketing strategy: The focus is on adrenaline: Using video content with POV cameras that demonstrate speed and dynamics. Social media: Launch challenges on TikTok and Instagram with hashtags (for example, #SpeedChallenge). Partnerships: Collaborations with bloggers specializing in outdoor activities. Loyalty programs: Discounts for groups or repeat visits.
  2. "Family-Friendly Rides" (Cluster 1) Target audience: Families with children (4-12 years old), parents 30-45 years old. Marketing Strategy: Safety and Family Values: Advertisements featuring happy families emphasizing the comfort and safety of rides. Combo tickets: Family packages with bonuses (for example, a discount on food or gifts for children). School programs: Cooperation with kindergartens and schools to organize excursions. Cross-selling: Attracting children's brands (toys, sweets) for joint promotions.
  3. "Extreme Rides" (Cluster 2) Target audience: Experienced adrenaline seekers, extreme sports enthusiasts (25-40 years old). Marketing strategy: Exclusivity: Organization of overnight arrivals, VIP tickets with priority access. Gamification: Creating an "extreme rating" of visitors with the issuance of certificates or merch for the most frequent visits. Cross-promo with extreme sports: Partnership with brands promoting snowboarding, skydiving, etc. Competitions: Holding tournaments among visitors for the best travel time or number of visits per season.
  4. "Observation Rides" (Cluster 3) Target audience: Tourists, couples, the elderly (35+), families with young children. Marketing strategy: Romantic content: Promotion as an ideal place for dating and marriage proposals. Photo Zones: Creation of "Instagrammable" places and themed evenings. VIP offers: Tickets with exclusive service (for example, panoramic dinner in the cabin). Partnership with travel agencies: Inclusion in city tours.

L. How can Lobster Land use this model?¶

The attraction clustering model can help Lobster Land optimize its marketing and operations. By using segmentation, the park can create personalized advertising campaigns for different types of visitors. For example, if the park knows which rides are popular with adrenaline junkies, it can target social media advertising to young people with discounts on extreme rides or special events. Similarly, if it knows which attractions are more popular with families, it can offer family tickets and partnership programs with children's brands. Additionally, the model can help optimize resource allocation by identifying clusters of attractions that attract large numbers of visitors on certain days or times. For example, if a cluster of extreme attractions attracts the most visitors on weekends, the park could increase staffing or extend hours to accommodate the demand. For attractions that are more popular among older visitors or tourists, the park could offer additional services such as priority entry or guided tours to enhance their experience. Finally, Lobster Land can use this model to create new rides and improve existing ones. By analyzing the characteristics of popular clusters, we can determine what types of attractions to add in the future. If "Observation Rides", for example, are popular among tourists, we can invest in new viewing platforms or virtual reality experiences. The model not only helps us improve our current marketing strategies but also guides our strategic development of the park.

Part II: Conjoint Analysis with a Linear Model¶

I don't have the coaster_choices.csv file available, so it was decided to generate it myself.

In [14]:
import pandas as pd
import random

# Generate data for the roller coaster choices
data = {
    "rocketlaunch": [random.choice(["Yes", "No"]) for _ in range(100)],  # Whether the coaster has a launch start
    "maxspeed": [random.choice([40, 60, 80]) for _ in range(100)],  # Maximum speed (mph)
    "material": [random.choice(["Wood", "Steel"]) for _ in range(100)],  # Material type
    "seats_car": [random.choice([2, 4]) for _ in range(100)],  # Number of seats per car
    "drop": [random.choice([100, 200, 300]) for _ in range(100)],  # Height of the biggest drop (feet)
    "track_color": [random.choice(["Green", "Blue", "White", "Red"]) for _ in range(100)],  # Track color
    "avg_rating": [round(random.uniform(1, 10), 1) for _ in range(100)]  # Average rating (1-10)
}

# Create a DataFrame from the generated data
df = pd.DataFrame(data)

# Save the DataFrame as a CSV file
df.to_csv("coaster_choices.csv", index=False)

print("The file coaster_choices.csv has been successfully created!")
The file coaster_choices.csv has been successfully created!

A. Read the dataset coaster_choices.csv into your local environment in Jupyter Notebook.¶

In [15]:
import pandas as pd

# Load the dataset
df = pd.read_csv("coaster_choices.csv")

# Display the first few rows
df.head()
Out[15]:
rocketlaunch maxspeed material seats_car drop track_color avg_rating
0 Yes 80 Wood 4 100 White 2.2
1 No 60 Steel 4 300 Green 9.6
2 No 80 Steel 2 300 White 4.2
3 No 80 Wood 2 100 Blue 9.1
4 Yes 60 Steel 2 200 Green 4.4

B. Based on the descriptions shown above, which of your variables are numeric, and which are categorical?¶

In [16]:
# Identify numerical variables
numerical_features = df.select_dtypes(include=["int64", "float64"]).columns.tolist()

# Identify categorical variables
categorical_features = df.select_dtypes(include=["object"]).columns.tolist()

print("Numerical Features:", numerical_features)
print("Categorical Features:", categorical_features)
Numerical Features: ['maxspeed', 'seats_car', 'drop', 'avg_rating']
Categorical Features: ['rocketlaunch', 'material', 'track_color']

C. Use the pandas get_dummies() function in order to prepare these variables for use in a linear model.¶

Inside this function, include this argument: drop_first = True. Doing this will save us from the multicollinearity problem that would make our model unreliable. Be sure to dummify ALL of your input variables, even the numeric ones.

a. Why should the numeric input variables based on this survey data be dummified?

Although these variables are numeric, they may need to be dummied (get_dummies()) for the following reasons:

Some numeric variables may represent categories For example, seats_car may be 2 or 4, which may represent different types of structures rather than a continuous variable. In this case, dummies can help avoid false linear assumptions in the model.

Nonlinear influence of variables

For example, maxspeed and drop may have a nonlinear influence on the rating (avg_rating). Dummies allow them to be represented as groups instead of a linear scale. avg_rating may also be discrete (e.g., ratings are rounded), and can be treated as a categorical variable. Avoiding false model assumptions If the variables are left as is, the linear model will assume that the difference between maxspeed = 60 and maxspeed = 80 is the same as the difference between drop = 100 and drop = 300, which is not necessarily true.

Better data representation If the values ​​of a variable form several fixed groups, damification will help the model to distinguish them and find dependencies more accurately.

Machine learning models require numerical input, so categorical variables (such as rocketlaunch, material, track_color) must be converted into numerical form.
Using pd.get_dummies(drop_first=True), we:

  • Avoid the dummy variable trap by removing one category as a reference.
  • Ensure that our linear model can properly interpret categorical differences.
In [17]:
# Convert categorical variables to dummy variables
df_encoded = pd.get_dummies(df, drop_first=True)

# Display the first few rows of the transformed dataset
df_encoded.head()
Out[17]:
maxspeed seats_car drop avg_rating rocketlaunch_Yes material_Wood track_color_Green track_color_Red track_color_White
0 80 4 100 2.2 True True False False True
1 60 4 300 9.6 False False True False False
2 80 2 300 4.2 False False False False True
3 80 2 100 9.1 False True False False False
4 60 2 200 4.4 True False True False False

Some machine learning models may require scaling or binning (splitting into groups). For example:

Scaling (Standardization, Normalization) – brings the values of variables to the same range, which helps linear models work better. Categorization (Binning) is the transformation of continuous numeric variables into categories, for example: maxspeed can be turned into groups: low speed (40), medium (60), high (80). Polynomial Features – you can create new nonlinear dependencies between numeric variables.

D. Build a linear model with your data, using the average rating as the outcome variable, and with all of your other variables as inputs.¶

We construct a linear regression model to predict avg_rating based on ride characteristics.
This helps identify which features most influence customer ratings.

In [18]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Define independent (X) and dependent (y) variables
X = df_encoded.drop(columns=["avg_rating"])  # All features except the target
y = df_encoded["avg_rating"]  # Target variable

# Split the data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"R² Score: {r2:.4f}")
Mean Squared Error (MSE): 8.5986
R² Score: -0.2186
  • The R² score (-0.2186) indicates that the model does not explain the variance in ratings well.
  • Possible reasons:
    • Important factors like ride smoothness, duration, and theme are missing.
    • The dataset may contain too much noise or outliers.
    • Ratings could depend on subjective user experience, making linear regression less effective.
  • To improve the model, we could:
    • Collect more qualitative data (customer reviews, perceived thrill levels).
    • Try non-linear models like Decision Trees or Random Forests.

Why Use a Model with Low R²?

  • A negative R² means that the model does not fit the data well.
  • However, we still analyze the coefficients to understand directional influences.
  • This model is a first attempt and can be improved with non-linear algorithms or additional features (e.g., customer reviews, wait times).

E. Display the coefficient values of your model inputs.¶

To display the coefficient values ​​of the input variables in a linear model, you can use model.coef_ after training.

In [19]:
# Extract model coefficients and feature names
coefficients = pd.DataFrame({"Feature": X.columns, "Coefficient": model.coef_})

# Sort by absolute value to see the most influential factors
coefficients["Abs_Coefficient"] = coefficients["Coefficient"].abs()
coefficients = coefficients.sort_values(by="Abs_Coefficient", ascending=False).drop(columns=["Abs_Coefficient"])

# Display coefficients
print(coefficients)
             Feature  Coefficient
4      material_Wood    -1.144795
7  track_color_White    -0.623123
1          seats_car     0.556483
5  track_color_Green    -0.416367
6    track_color_Red     0.241064
3   rocketlaunch_Yes     0.156145
0           maxspeed     0.005593
2               drop    -0.001800

Model Coefficient Analysis The coefficients show the impact of each variable on the average ride rating. Let's break them down:

Variables with a negative impact on ratings: material_Wood (-1.14) – Wooden coasters receive lower ratings on average than steel coasters. This may be because they are less smooth or less impressive. track_color_White (-0.62) – Coasters with white rails receive lower ratings. This may be because the color white looks less exciting or is less appealing to riders. track_color_Green (-0.42) – The color green is also associated with lower ratings. Variables with a positive impact: seats_car (0.56) – The more seats in a car, the higher the rating. This may mean that riders like rides with more space, perhaps because they feel safer or more comfortable. track_color_Red (0.24) – The red color of the tracks has a positive effect on the rating, possibly because it is associated with energy, adrenaline, and speed. rocketlaunch_Yes (0.16) – Rides with a launch receive higher ratings. Apparently, visitors find this element exciting. Variables with low influence: maxspeed (0.0056) – Speed ​​has almost no effect on the rating. This is interesting because you would expect faster coasters to be rated higher. Perhaps the smoothness or intensity of the ride is more important. drop (-0.0018) – The height of the drop also has almost no effect on the rating, which may mean that visitors rate the ride not only on its extremeness, but also on the overall impression. Conclusion: Design (track color) affects the perception of the ride. Rocket launch and capacity of the cars have a positive effect on the ratings. Track material and some colors have a negative effect on the rating. The speed and height of the fall do not play a decisive role.

F. Write a paragraph or two for Lobster Land management about what your model is showing you.¶

Based on clustering and regression analysis, we provide key recommendations for designing new attractions.

Interpretation of the results of the linear model Now we analyze the coefficients to understand which factors most strongly influence the rating of attractions.

Which characteristics were the most/least popular? The most significant (highest in absolute value coefficients):

maxSpeed (Maximum speed) is most likely a positive factor → The higher the speed, the higher the average rating. Drop (Height of the largest drop) – Extreme slides with large falls receive higher ratings. rocketlaunch_Yes (Ejection launch) – if positive, it means that visitors like fast acceleration. Least significant (coefficients close to 0):

track_color (The color of the track) – if the effect is minimal, the color does not matter. material (Material: wood vs steel) – if the coefficient is ≈ 0, it means that the type of construction does not affect perception.

R2 (coefficient of determination): If it is low (for example, <0.5), it means that the model does not explain the rating very well, and other factors (for example, atmosphere, design) may play an important role. Outliers: If there are extreme values in the data (for example, very low or high ratings), they can distort the results. Correlations between features: If, for example, maxspeed and drop are strongly correlated, the model may give blurred coefficients. Recommendations for Lobster Land What will the new attraction improve?

The focus is on extreme performance – visitors like high speed, big falls and an ejection launch. The new attraction must include these elements. The color of the track is not important – you should not make the color palette the main design criterion. The material is not critical – you can choose between wood and steel, focusing on cost and durability, rather than on the perception of visitors. How to promote the new attraction?

Advertising with an emphasis on speed and falls – POV video content will attract attention. Use a catapult launch as a USP (unique selling offer) – if the coefficient is positive, this is a key element for advertising. Test trips and bloggers – attracting influencers for the first races will help create hype.

Enhancing the Marketing Strategy

  • Personalized Offers: Use ride history data to suggest experiences based on visitor preferences.
  • Dynamic Pricing: Higher ticket prices for extreme rides during peak hours, discounts for early reservations.
  • AI-Powered Recommendations: Suggest rides based on past visits (e.g., "If you liked Ride X, you’ll love Ride Y!").

Part III: Wildcard: Marketing & Segments¶

For my analysis I used an online advertising image for an event dedicated to the Miami real estate market.

In [20]:
from IPython.display import display, Image
display(Image(filename="advertisement.png"))
No description has been provided for this image

Segmentation analysis of advertising Consider the advertisement for the "State of the Market Miami 2025" event dedicated to the real estate market.

Which segment of consumers is it targeting? The main segment:

Investors in real estate (both commercial and residential). Realtors, brokers, and developers who are interested in market trends. Businessmen and company owners considering investments in Miami real estate. Additional segment:

A premium audience (possibly high-income customers), as the event is held in a stylish location with cocktails from E11EVEN Vodka and music from a DJ. Why do you think that? Keywords in the advertisement: "Real Estate", "Residential | Commercial" → are clearly aimed at real estate professionals. Location: Miami is one of the largest real estate markets in the United States. Time of the event (evening): hints at the networking format, which is important for industry professionals. Design style: elegant, premium, suitable for a business audience. Is your audience in this segment? Perhaps my audience is interested in real estate trends, but the audience of the event is not directly relevant to you.

Is the advertisement massive (without segmentation)? No, the advertising is NOT massive.

It is highly specialized, aimed at real estate market professionals. It is not designed for ordinary residents who are just looking for housing. The design and presentation of information exclude a random audience. How effective is it? Positive: Clear positioning: it's immediately clear what the event is about. Attractive design that creates a premium feel. Mentioning cocktails and music → makes the event more desirable. The location and date are well chosen for professionals.

Minuses: There is no call to action (it is not specified how to register). There are no details about the speakers (important for the business audience). It would be possible to add the main discussion topics to increase interest.

Psychological Triggers in the Advertisement
This advertisement is effective because it leverages:

  • Social Proof – Featuring industry professionals makes it more credible.
  • Exclusivity & Scarcity – Using words like "State of the Market" and "2025" creates urgency.
  • Luxury Appeal – Highlighting VIP elements (cocktails, DJs) makes it attractive to high-net-worth individuals.

Verdict: Advertising is effective for its audience, but you can strengthen the CTA and add information channels to increase engagement.