Altyn Baigaliyeva

Part I: Data Exploration and Initial Analysis:¶

Step 1: We use pd.read_csv() and head() to load and preview the data because it allows us to check the structure of the dataset and make sure it was loaded correctly.

Step 2: We use info() and isnull().sum() to analyze data types and missing values ​​because it helps us determine if the data needs to be cleaned before analysis.

Step 3: We use select_dtypes() to separate variables into numeric and categorical because different data types require different analysis and visualization methods.

A. Data Exploration¶

1. Load the dataset into Python.

2. Call the head() function to display the first few rows of the data

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

file_path = "lobster_loyalty.csv"
df = pd.read_csv(file_path)

print(df.head())

df.info()
print(df.isnull().sum())

categorical_columns = df.select_dtypes(include=['object']).columns.tolist()
numerical_columns = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

print("Categorical columns:", categorical_columns)
print("Numerical columns:", numerical_columns)
   Customer_ID Membership_Tier  Spending_Per_Visit  Visit_Count
0            1          Silver               38.42           17
1            2          Bronze               35.01            1
2            3          Bronze               27.60            6
3            4          Silver               29.23           23
4            5            Gold               36.52           25
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Customer_ID         500 non-null    int64  
 1   Membership_Tier     500 non-null    object 
 2   Spending_Per_Visit  500 non-null    float64
 3   Visit_Count         500 non-null    int64  
dtypes: float64(1), int64(2), object(1)
memory usage: 15.8+ KB
Customer_ID           0
Membership_Tier       0
Spending_Per_Visit    0
Visit_Count           0
dtype: int64
Categorical columns: ['Membership_Tier']
Numerical columns: ['Customer_ID', 'Spending_Per_Visit', 'Visit_Count']

3. Which of the variables in this dataset are categorical, and which are numeric?

Categorical columns: 'Membership_Tier'

Numerical columns: 'Customer_ID', 'Spending_Per_Visit', 'Visit_Count'

4. Check for missing values. Are there any? If so, make a decision regarding removal and/or imputation. If not, drive on to the next step.

There are no missing values, so no data cleaning is required. Categorical variable: Membership_Tier (membership level). Numeric variables: Spending_Per_Visit (average check). Visit_Count (number of visits). Customer_ID (identifier, does not need to be analyzed).

B. Customer Behavior by Membership Tier¶

1. Create a bar plot to compare the average amount spent per visit by membership tier. Sort the bars either from tallest to shortest, or shortest to tallest. Fill them with any (non-default) color of your choice. Be sure to include a title, along with axis labels

To complete next step, we need to:

Plot a bar chart of the average check (Spending_Per_Visit) for each membership level using barplot() because it allows us to visually compare the differences between groups. Sort the bars in ascending or descending order to make it easier to see trends. Add a title and axis labels to make the graph informative.

In [5]:
import matplotlib.pyplot as plt
import seaborn as sns

# Calculate average spending per visit for each membership tier
avg_spending = df.groupby("Membership_Tier")["Spending_Per_Visit"].mean().sort_values()

# Define custom colors: fuchsia, peach, and amber
custom_colors = ["#FF00FF", "#FFDAB9", "#FFBF00"]  # Fuchsia, Peach, Amber

# Create the bar plot
plt.figure(figsize=(8, 5))
sns.barplot(x=avg_spending.index, y=avg_spending.values, hue=avg_spending.index, 
            palette=custom_colors, legend=False)  # Explicit hue assignment

# Add labels and title
plt.xlabel("Membership Tier")
plt.ylabel("Average Spending Per Visit ($)")
plt.title("Average Spending Per Visit by Membership Tier")

# Show the plot
plt.show()
No description has been provided for this image

To visually compare the average spending per visit across membership tiers, we created a bar plot using seaborn.

  • We grouped the data by Membership_Tier and calculated the mean spending per visit.
  • Instead of the default colors, we used custom colors:
    • Fuchsia (#FF00FF) for one tier
    • Peach (#FFDAB9) for another
    • Amber (#FFBF00) for the third
  • To comply with the latest seaborn requirements, we assigned hue=Membership_Tier and disabled the legend (legend=False).

The resulting plot provides a clear and visually appealing comparison of spending habits across different membership levels.

2. What does this bar plot suggest about customer spending habits?

Graph analysis: Average spending per visit by membership level The higher the membership level, the higher the average check.

Gold clients spend the most per visit. Silver clients are at the average level. Bronze clients spend the least per visit. The loyalty program is effective

Increasing the membership level is associated with an increase in spending, which may indicate that clients are motivated to buy more often with a higher status. Recommendations:

Analyze which bonuses and privileges motivate clients to increase their level. Consider strategies to increase the average check of Bronze clients (for example, additional discounts, bonus programs).

Part II: Hypothesis Testing – Spending by Membership Tier¶

A. Formulating Hypotheses¶

Formulation of hypotheses for one-way ANOVA test Null hypothesis (H₀): Average spending per visit does not differ between membership levels (Bronze, Silver, Gold).

Alternative hypothesis (H₁): Average spending per visit differs between at least one pair of membership levels.

B. Running the ANOVA Test¶

1. Perform a one-way ANOVA test comparing spending per visit across the three membership tiers.

2. Report the F-statistic and p-value.

We used a one-way ANOVA (Analysis of Variance) test because we wanted to compare the average spending per visit between three independent membership level groups (Bronze, Silver, Gold). ANOVA is used when there are more than two groups and we want to determine if there are statistically significant differences between them.

In [6]:
from scipy.stats import f_oneway

# Perform one-way ANOVA test
anova_result = f_oneway(bronze_spending, silver_spending, gold_spending)

# Output F-statistic and p-value
print(f"ANOVA Test Results: F-statistic = {anova_result.statistic:.4f}, p-value = {anova_result.pvalue:.4f}")
ANOVA Test Results: F-statistic = 231.9584, p-value = 0.0000

3. Does membership tier significantly impact spending?

ANOVA Test Results Analysis F-statistic = 231.9584 (high value indicates significant differences between groups). p-value = 0.0000 (p < 0.05). Conclusion: Membership level significantly affects average check (Spending per visit). Since the p-value is much less than 0.05, we reject the null hypothesis and confirm that there are statistically significant differences in spending between membership levels.

C. Additional Analysis¶

If the ANOVA test is significant, conduct pairwise t-tests between membership groups (Gold vs. Silver, Gold vs. Bronze, Silver vs. Bronze) using a Bonferroni correction to adjust for multiple comparisons.

1. What is the new adjusted alpha threshold? What is the purpose of using a Bonferroni correction?

We perform 3 pairwise t-tests, which increases the probability of false positive results (type I error). Bonferroni correction reduces this risk by dividing the standard alpha level (0.05) by the number of tests (3):

2. Run the three tests.

In [7]:
from scipy.stats import ttest_ind

# Perform pairwise t-tests with unequal variance (Welch’s t-test)
t_stat1, p_val1 = ttest_ind(gold_spending, silver_spending, equal_var=False)
t_stat2, p_val2 = ttest_ind(gold_spending, bronze_spending, equal_var=False)
t_stat3, p_val3 = ttest_ind(silver_spending, bronze_spending, equal_var=False)

# Apply Bonferroni correction
alpha_corrected = 0.05 / 3

# Print results
print(f"T-test (Gold vs Silver): p-value = {p_val1:.4f} {'(Significant)' if p_val1 < alpha_corrected else '(Not Significant)'}")
print(f"T-test (Gold vs Bronze): p-value = {p_val2:.4f} {'(Significant)' if p_val2 < alpha_corrected else '(Not Significant)'}")
print(f"T-test (Silver vs Bronze): p-value = {p_val3:.4f} {'(Significant)' if p_val3 < alpha_corrected else '(Not Significant)'}")
T-test (Gold vs Silver): p-value = 0.0000 (Significant)
T-test (Gold vs Bronze): p-value = 0.0000 (Significant)
T-test (Silver vs Bronze): p-value = 0.0000 (Significant)

3. What do the pairwise comparisons reveal? Based on the results obtained here, what can you share with Lobster Land management about the three groups’ spending habits?

Pairwise t-test results analysis Gold vs Silver → p-value = 0.0000 → significant differences. Gold vs Bronze → p-value = 0.0000 → significant differences. Silver vs Bronze → p-value = 0.0000 → significant differences. Since all p-values ​​are less than 0.0167 (Bonferroni correction), the differences in average checks between all membership levels are statistically significant.

Conclusion for Lobster Land management: The higher the membership level, the higher the average check. The differences between all membership levels are significant, confirming the impact of the loyalty program on expenses. Recommendations: Encourage Bronze and Silver customers to increase their membership level (additional bonuses, discounts, exclusive offers). Research what factors keep customers at Bronze/Silver and work on motivating them to upgrade.

Part III: Chi-Square Goodness of Fit – Visit Frequency by Membership Tier¶

A. Data Engineering¶

1. Starting with the visit_count variable, create a new binned variable called “frequency_group.” Designate all visitors as either: frequent visitors, occasional visitors, or rare visitors. Frequent visitors should be those who visited the park 10 or more times in the season. Rare visitors should be those who visited 3 or fewer times, and occasional visitors should be everyone in between.

To perform the next step, we need to:

Create a new categorical variable frequency_group that will divide customers by their frequency of visits. Use the apply() method to classify customers: "Frequent": ≥ 10 visits. "Occasional": 4–9 visits. "Rare": ≤ 3 visits.

In [8]:
# Define a function to categorize visit frequency
def categorize_visits(visits):
    if visits >= 10:
        return "Frequent"
    elif 4 <= visits <= 9:
        return "Occasional"
    else:
        return "Rare"

# Apply function to create a new column
df["frequency_group"] = df["Visit_Count"].apply(categorize_visits)

# Display the first few rows to verify
print(df[["Visit_Count", "frequency_group"]].head())
   Visit_Count frequency_group
0           17        Frequent
1            1            Rare
2            6      Occasional
3           23        Frequent
4           25        Frequent
In [9]:
# Count the number of customers in each frequency group
print(df["frequency_group"].value_counts())
frequency_group
Frequent      371
Occasional     97
Rare           32
Name: count, dtype: int64

Distribution of customers by frequency of visits groups:

Frequent customers: 371 customers (the largest group). Occasional customers: 97 customers. Rare customers: 32 customers (the smallest group). Conclusion: Most customers visit the park 10 times or more. This indicates high loyalty of the main part of users. However, 32 customers visited the park no more than 3 times, which may indicate weak involvement of this category.

B. Next, let’s get ready to run a statistical test to explore the relationship between membership tier and frequency_group.¶

1. Before digging into the data, what is your null hypothesis regarding membership tier and frequency group?

Formulation of hypotheses Null hypothesis (H₀): The frequency of visits (frequency_group) does not depend on the membership level (Membership_Tier). The percentage of frequent, regular, and rare visitors is the same for all levels.

2. What is the alternative hypothesis?

Alternative hypothesis (H₁): The frequency of visits (frequency_group) depends on the membership level (Membership_Tier). The number of frequent, regular, and rare visitors varies between levels.

3. Under the null hypothesis, what are the expected numbers of Frequent, Occasional, and Rare visitors across each membership tier? (For this answer, you should be showing 9 total values). Do not just show the numbers here, but also show how you got them – what calculation generated these expected values?

Calculating expected frequencies Method:

Build a contingency table (actual values). Calculate expected frequencies using the formula:E_ij = (Row Total * Column Total) / Grand Total

  • E_ij — expected number of customers in group (i, j)
  • Row Total — total number of customers in a given membership level
  • Column Total — total number of customers in a given frequency group
  • Grand Total — total number of customers in the dataset
In [10]:
import numpy as np

# Create a contingency table (actual counts)
contingency_table = pd.crosstab(df["Membership_Tier"], df["frequency_group"])

# Calculate expected frequencies
row_totals = contingency_table.sum(axis=1).values.reshape(-1, 1)  # Row totals
col_totals = contingency_table.sum(axis=0).values  # Column totals
grand_total = contingency_table.values.sum()  # Total number of observations

expected_frequencies = (row_totals @ col_totals.reshape(1, -1)) / grand_total

# Convert to DataFrame for better readability
expected_df = pd.DataFrame(expected_frequencies, index=contingency_table.index, columns=contingency_table.columns)

# Print results
print("Actual Frequencies:\n", contingency_table)
print("\nExpected Frequencies:\n", expected_df)
Actual Frequencies:
 frequency_group  Frequent  Occasional  Rare
Membership_Tier                            
Bronze                 67          55    32
Gold                  154           0     0
Silver                150          42     0

Expected Frequencies:
 frequency_group  Frequent  Occasional    Rare
Membership_Tier                              
Bronze            114.268      29.876   9.856
Gold              114.268      29.876   9.856
Silver            142.464      37.248  12.288

Preliminary findings: Gold and Silver members visit the park more often than expected. Bronze members visit the park less often than expected. The differences between actual and expected values ​​are significant, indicating that visit frequency may be dependent on membership level.

C. Running the Chi-Square Test¶

1. With your expected values, along with the observed values from the actual data, run a chi-square goodness of fit test in Python. What is the chi-square statistic, and what is the p-value, for this test?

To perform next step we must:

Use the observed (contingency_table) and expected (expected_df) frequencies. Calculate the χ² statistic using chi2_contingency(), because this test determines whether there is a statistically significant relationship between the membership level (Membership_Tier) and the frequency of visits (frequency_group).

In [11]:
from scipy.stats import chi2_contingency

# Perform Chi-Square test
chi2_stat, p_value, dof, expected_values = chi2_contingency(contingency_table)

# Print results
print(f"Chi-Square Statistic: {chi2_stat:.4f}")
print(f"p-value: {p_value:.4f}")
Chi-Square Statistic: 157.2728
p-value: 0.0000

Chi-Square Test Results Analysis χ² statistic = 157.2728 (high value indicates significant differences between groups). p-value = 0.0000 (p < 0.05). Conclusion: Since p-value < 0.05, we reject the null hypothesis. This means that the membership level (Membership_Tier) has a statistically significant effect on the frequency of visits (frequency_group).

Gold and Silver customers visit the park more often than expected. Bronze customers visit the park less often than expected.

D. Interpreting the Results¶

1. Does membership level have a significant effect on visit frequency?

Yes, membership level has a significant effect on visit frequency (p-value = 0.0000). We reject the null hypothesis, which means that the distribution of frequent, normal, and infrequent visitors differs significantly by membership level.

2. If there is a significant relationship, which membership tier shows the highest proportion of frequent visitors? To understand the overall impact of the membership tiers to Lobster Land’s profitability, what else would you need to know?

Which membership level has the highest proportion of frequent visitors?

Gold members – 100% Frequent (154 out of 154) (much more than expected). Silver members – most Frequent (150 out of 192). Bronze members – the highest proportion of infrequent visitors (32 out of 154), fewer frequent visitors than expected (67 instead of 114). Gold member has the highest proportion of frequent visitors.

What additional data is needed to evaluate the impact of membership level on park profitability?

To more accurately assess the impact of membership on revenue, we need:

Average check (Spending_Per_Visit) depending on the frequency of visits. Membership duration data (how long customers remain at one level). Loyalty program expenses (what bonuses are provided to customers, and how much do they pay off). Income from additional services (food, attractions, VIP areas). Conclusion: Lobster Land management should focus on increasing the number of Bronze customers and motivating them to upgrade to Silver/Gold.

E. Demonstrate where the chi-square number from your test came from.¶

Formula for the χ² criterion:

χ² = Σ [(O_i - E_i)² / E_i]

where:

  • O_i — observed value (Observed)

  • E_i — expected value (Expected)

  • Σ — sum over all cells of the contingency table

To perform this step we must:

Calculate the difference (O - E) for each cell. Square the difference and divide by the expected value (E). Sum all the obtained values ​​to obtain the overall χ² statistic.

In [1]:
# Observed frequencies (from contingency table)
observed = [
    [67, 55, 32],  # Bronze (Frequent, Occasional, Rare)
    [154, 0, 0],   # Gold (Frequent, Occasional, Rare)
    [150, 42, 0]   # Silver (Frequent, Occasional, Rare)
]

# Expected frequencies (calculated earlier)
expected = [
    [114.268, 29.876, 9.856],   # Bronze
    [114.268, 29.876, 9.856],   # Gold
    [142.464, 37.248, 12.288]   # Silver
]

# Manually calculating chi-square statistic
chi_square = 0

# Loop through observed and expected values to compute chi-square
for i in range(len(observed)):  # Membership tiers
    for j in range(len(observed[i])):  # Frequency groups
        O = observed[i][j]  # Observed value
        E = expected[i][j]  # Expected value
        chi_square += ((O - E) ** 2) / E  # Apply chi-square formula

# Display result
print("Manually Calculated Chi-Square Statistic:", round(chi_square, 4))
Manually Calculated Chi-Square Statistic: 157.2728

Manual calculation of χ² = 157.2728 completely coincided with the result of Chi-Square Statistic, which confirms the correctness of the calculations.

We manually input the observed and expected frequency values as lists.

We initialize chi_square = 0 to store the sum.

We iterate through each observed and expected value using nested loops.

We apply the chi-square formula to each pair of values.

Finally, we sum up all the values to get the final chi-square statistic.