Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

see the attached file. And please use metrics enough to assess bias. This is the important emphasis aspect in this assignment. AI/ML bias techniques. Refer to Disparate Impact, BERT AND OTHER...

2 answer below »
Dear Participants,
Please find below the Time Series Forecasting Project instructions:
· You have to submit 2 files : 
1. Answer Report: In this, you need to submit all the answers to all the questions in a sequential manner. It should include the detailed explanation of the approach used, insights, inferences, all outputs of codes like graphs, tables etc. Your report should not be filled with codes. You will be evaluated based on the business report.
Note: In the business report, there should be a proper interpretation of all the tasks performed along with actionable insights. Only the presence of interpretation of the models is not sufficient to be eligible for full marks in each of the criteria mentioned in the ru
ic. Marks will be deducted wherever inferences are not clearly mentioned.
2. Jupyter Notebook file: This is a must and will be used for reference while evaluating.
Any assignment found copied/ plagiarized with another person will not be graded and marked as zero. Please ensure timely submission as a post-deadline assignment will not be accepted.
Problem 1 for the Data Set : Shoesales.csv
You are an analyst in the IJK shoe company and you are expected to forecast the sales of the pairs of shoes for the upcoming 12 months from where the data ends. The data for the pair of shoe sales have been given to you from January 1980 to July 1995.
Problem 2 for the Data Set SoftDrink.csv:
You are an analyst in the RST soft drink company and you are expected to forecast the sales of the production of the soft drink for the upcoming 12 months from where the data ends. The data for the production of soft drink has been given to you from January 1980 to July 1995.
Please do perform the following questions on each of these two data sets separately.
1. Read the data as an appropriate Time Series data and plot the data.
2. Perform appropriate Exploratory Data Analysis to understand the data and also perform decomposition.
3. Split the data into training and test. The test data should start in 1991.
4. Build various exponential smoothing models on the training data and evaluate the model using RMSE on the test data.
Other models such as regression,naïve forecast models, simple average models etc. should also be built on the training data and check the performance on the test data using RMSE.
5. Check for the stationarity of the data on which the model is being built on using appropriate statistical tests and also mention the hypothesis for the statistical test. If the data is found to be non-stationary, take appropriate steps to make it stationary. Check the new data for stationarity and comment.
Note: Stationarity should be checked at alpha = 0.05.
6. Build an automated version of the ARIMA/SARIMA model in which the parameters are selected using the lowest Akaike Information Criteria (AIC) on the training data and evaluate this model on the test data using RMSE.
7. Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the training data and evaluate this model on the test data using RMSE.
8. Build a table with all the models built along with their co
esponding parameters and the respective RMSE values on the test data.
9. Based on the model-building exercise, build the most optimum model(s) on the complete data and predict 12 months into the future with appropriate confidence intervals
ands.
10. Comment on the model thus built and report your findings and suggest the measures that the company should be taking for future sales.
    Extended Project - Time Series Forecasting Project
    Criteria
    Ratings
    Pts
    This criterion is linked to a Learning Outcome1. Read the data as an appropriate Time Series data and plot the data.
    This area will be used by the assessor to leave comments related to this criterion.
    2.0 pts
    This criterion is linked to a Learning Outcome2. Perform appropriate Exploratory Data Analysis to understand the data and also perform decomposition.
    This area will be used by the assessor to leave comments related to this criterion.
    5.0 pts
    This criterion is linked to a Learning Outcome3. Split the data into training and test. The test data should start in 1991.
    This area will be used by the assessor to leave comments related to this criterion.
    2.0 pts
    This criterion is linked to a Learning Outcome4. Build various exponential smoothing models on the training data and evaluate the model using RMSE on the test data. Other models such as regression,naïve forecast models, simple average models etc. should also be built on the training data and check the performance on the test data using RMSE. (Please do try to build as many models as possible and as many iterations of models as possible with different parameters.)
    This area will be used by the assessor to leave comments related to this criterion.
    16.0 pts
    This criterion is linked to a Learning Outcome5. Check for the stationarity of the data on which the model is being built on using appropriate statistical tests and also mention the hypothesis for the statistical test. If the data is found to be non-stationary, take appropriate steps to make it stationary. Check the new data for stationarity and comment. Note: Stationarity should be checked at alpha = 0.05.
    This area will be used by the assessor to leave comments related to this criterion.
    3.0 pts
    This criterion is linked to a Learning Outcome6. Build an automated version of the ARIMA/SARIMA model in which the parameters are selected using the lowest Akaike Information Criteria (AIC) on the training data and evaluate this model on the test data using RMSE.
    This area will be used by the assessor to leave comments related to this criterion.
    8.0 pts
    This criterion is linked to a Learning Outcome7. Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the training data and evaluate this model on the test data using RMSE.
    This area will be used by the assessor to leave comments related to this criterion.
    8.0 pts
    This criterion is linked to a Learning Outcome8. Build a table (create a data frame) with all the models built along with their co
esponding parameters and the respective RMSE values on the test data.
    This area will be used by the assessor to leave comments related to this criterion.
    2.0 pts
    This criterion is linked to a Learning Outcome9. Based on the model-building exercise, build the most optimum model(s) on the complete data and predict 12 months into the future with appropriate confidence intervals
ands.
    This area will be used by the assessor to leave comments related to this criterion.
    3.0 pts
    This criterion is linked to a Learning Outcome10. Comment on the model thus built and report your findings and suggest the measures that the company should be taking for future sales.(Please explain and summarise the various steps performed in this project. There should be proper business interpretation and actionable insights present.)
    This area will be used by the assessor to leave comments related to this criterion.
    5.0 pts
    This criterion is linked to a Learning OutcomePlease reflect on all that you learnt and fill this reflection report. You have to copy the link and paste it on the URL bar of your respective
owser. https:
docs.google.com/forms/d/e/1FAIpQLSeBxE1cfP7ugyx8sa1JFGg_Nkv-jlEztsszbc9US911oWo2KQ/viewform
    This area will be used by the assessor to leave comments related to this criterion.
    0.0 pts
    This criterion is linked to a Learning OutcomeQuality of Business Report (Please refer to the Evaluation Guidelines for Business report checklist. Marks in this criteria are at the moderator's discretion)
    This area will be used by the assessor to leave comments related to this criterion.
    6.0 pts
    Total Points: 60.0
All the very best!
Regards,
Program Office
Top of Form
Bottom of Form
Answered 21 days After Jun 28, 2024

Solution

Pratibha answered on Jul 20 2024
10 Votes
Bias Mitigating
Identifying and Mitigating Bias in Ad Distribution: A Comprehensive Analysis
The purpose of the Project is to analyze the cu
ent advertising distribution patterns to identify any biases in ad_type, impressions, spending, and geographic targeting. Propose algorithms to mitigate these biases to ensure fai
epresentation.
About Dataset
1. ad_type: This variable categorizes the types of advertisements used in the Google Ads campaign (Video,Image, Text).
2. impressions: This variable indicates the range of the number of impressions (how many times the ad was displayed).
3. spend_usd: This variable explains the range of money spent in USD on the ads.
4. geo_targeting_included: This variable specifies the geographical regions where the ads were targeted(There are 50 different locations mentioned in the dataset).
Interpretation
This dataset helps analyze various aspects of a Google Ads campaign:
1. Ad Type Distribution: Understanding the proportion of different advertising types (video, text, and Image) can help assess which formats are being utilized most frequently.
2. Impression Ranges: Analyzing the distribution of advertising impressions helps understand the reach of the advertisments.
3. Spending Patterns: Evaluating the spending ranges gives insights into budget allocation and cost efficiency.
4. Geographic Targeting: Examining the geographical distribution can reveal which regions are being targeted more heavily and potentially co
elate this with performance metrics.
Use Cases
Performance Analysis: By combining these data points, one can evaluate the performance of different advertisment types across various geographic locations and spending ranges.
Budget Allocation: Understanding which combinations yield the highest impressions and engagement can inform future budget allocation strategies.
Targeting Optimization: Identifying which regions respond best to certain ad types and spending ranges can help optimize targeting strategies for future campaigns.
Objectives
1. Identify Bias in Ad Distribution: Assess whether there are biases in advertisement type, impression distribution, spending, and geographic targeting to ensure the fair representation across all demographics.
2. Enhance Fairness in Targeted Marketing: Develop strategies to ensure advertisements are delivered equitably across different regions and demographic groups.
3. Increase Transparency in Ad Spending: Provide clear insights into how advertising budgets are allocated and spent across different categories and regions.
4. Optimize Ad Performance Without Bias: Ensure that optimizing for performance does not inadvertently introduce or perpetuate biases.
5. Improve Data-Driven Decision Making: Utilize data analytics to make informed, unbiased decisions in advertisment targeting and spending.
Solutions Using Machine Learning
1. Identify Bias in Ad Distribution
Solution: Bias Detection Models
ML Techniques: Use clustering algorithms (e.g., K-means, DBSCAN) and classification models (e.g., logistic regression, decision trees) to detect patterns in ad distribution.
Implementation: Train models on historical ad data to identify discrepancies in how ads are distributed across different ad types, impression ranges, spending categories, and geographic locations.
Outcome: Highlight regions, demographics, or advertisment types where biases exist, enabling targeted interventions.
2. Enhance Fairness in Targeted Marketing
Solution: Fairness-Optimized Ad Delivery Algorithms
ML Techniques: Implement fairness-aware algorithms like Fairness Constraints in machine learning models or use post-processing techniques to adjust ad delivery.
Implementation: Adjust existing ad delivery algorithms to ensure that all demographic groups are equally represented. Use techniques such as demographic parity, equalized odds, and disparate impact removal.
Outcome: More equitable ad distribution across diverse user segments.
3. Increase Transparency in Ad Spending
Solution: Spending Transparency Dashboards
ML Techniques: Use data visualization tools and explainable AI (XAI) methods.
Implementation: Develop interactive dashboards that show how ad budgets are allocated and spent. Incorporate XAI techniques to explain ML model decisions regarding budget allocation.
Outcome: Clear, understandable insights into ad spending patterns, promoting trust and accountability.
4. Optimize Ad Performance Without Bias
Solution: Bias-Resistant Performance Models
ML Techniques: Train performance optimization models (e.g., gradient boosting, logistic regression, decision tree) with fairness constraints.
Implementation: Integrate fairness metrics into the loss function during model training to ensure that performance optimization does not favor any particular group.
Outcome: High-performing ads that are also fair and unbiased.
5. Improve Data-Driven Decision Making
Solution: Bias-Aware Data Analytics Tools
ML Techniques: Use advanced analytics and bias detection algorithms.
Implementation: Develop tools that analyze data for potential biases and provide actionable insights. Integrate these tools into the decision-making workflow to ensure bias-aware decisions.
Outcome: Informed decisions that take potential biases into account, leading to more equitable outcomes.
Motivation
In ML ensuring fairness is critical, because of societal biases embedded risk in historical data. If machine learning algorithms left unchecked then unintentionally inequalities can exist which leads to unfair treatment around various demographic
groups. This issue is address particularly in high stakes applications like credit scoring, law enforcement, and hiring, where biased decisions can greatly affect individuals' lives and opportunities. Addressing these biases is important for creating
equitable systems that serve all users/customers fairly.
Literature Review
The literature on fairness in machine learning highlights both the challenges and advancements in this field. Key fairness concepts like Demographic Parity and Equalized Odds, which aim to mitigate bias by adjusting model outcomes are
introduced by Vaidya, et al., 2024. To balance accuracy and fairness Most recent advancements have been used by Hort, et al., 2024. These studies underscore the ongoing need for innovative solutions to ensure that machine learning models
serve all demographic groups equitably. (for more Information Refer Last section *References*)
#import necessary li
aries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
#Load the dataset
df= pd.read_csv('google_ads.csv')
df.head() #show 1st 5 values
ad_type impressions spend_usd geo_targeting_included
0 Video <=10k 100-1k Alaska
1 Image 10k-100k 100-1k Ne
aska
2 Image 100k-1M 100-1k Ne
aska
3 Text 10k-100k 100-1k Oregon
4 Image 100k-1M 100-1k Idaho
print('Data Dimension', df.shape) ##Data size
df.isnull().sum() #checking missing values
Data Dimension (164998, 4)
ad_type 0
impressions 0
spend_usd 0
geo_targeting_included 0
dtype: int64
df.spend_usd.value_counts() #frequent value of spendusd on advertisement.
spend_usd
100 94878
100-1k 41620
1k-50k 27869
50k-100k 422
100k 209
Name: count, dtype: int64
df.describe(include='all').T #Basic descriptive statistics of data
count unique top freq
ad_type 164998 3 Video 79098
impressions 164998 5 <=10k 105836
spend_usd 164998 5 <100 94878
geo_targeting_included 164998 50 Arizona 12763
Exploratory Data Analysis (Data Processing and Visualization)
1. Data processing
2. Transparency in Ad Spending through visualizations i.e. Dashboard (Increase Transparency in Ad Spending)
3. Enhance Fairness in Targeted Marketing
## Function to clean and calculate average
def calculate_average_value(range_str):
# Remove unwanted characters and handle special cases
range_str = range_str.replace('<=', '').replace('>', '').replace('k', '000').replace('M', '000000').replace('=', '')

# Handle cases like '<100'
if '<' in range_str:
return int(range_str.replace('<', '')) / 2

# Split the range and calculate average
if '-' in range_str:
start, end = map(int, range_str.split('-'))
return (start + end) / 2
else:
return int(range_str)
# Apply the function to the columns
df['impressions_avg'] = df['impressions'].apply(calculate_average_value)
df['spend_usd_avg'] = df['spend_usd'].apply(calculate_average_value)
df.head(3) ## Show the first3 rows of the data
ad_type impressions spend_usd geo_targeting_included impressions_avg spend_usd_avg
0 Video <=10k 100-1k Alaska 10000.0 550.0
1 Image 10k-100k 100-1k Ne
aska 55000.0 550.0
2 Image 100k-1M 100-1k Ne
aska 550000.0 550.0
## Function to calculate the average for a column grouped by another column
def calculate_group_average(df, column_name, group_by_column):
return df.groupby(group_by_column)[column_name].mean()
# Calculate averages
average_impressions = calculate_group_average(df, 'impressions_avg', 'geo_targeting_included')
average_spend_usd = calculate_group_average(df, 'spend_usd_avg', 'geo_targeting_included')
print("Average Impressions by Geographic Location:")
print(average_impressions)
print("\nAverage Spend USD by Geographic Location:")
print(average_spend_usd)
Average Impressions by Geographic Location:
geo_targeting_included
Alabama 213515.176374
Alaska 164140.000000
Arizona 165583.718561
Arkansas 186717.325228
California 346230.191827
Colorado 79630.303030
Connecticut 34216.494845
Delaware 78465.909091
Florida 273290.953992
Georgia 376932.929093
Hawaii 52277.580071
Idaho 36149.425287
Illinois 236116.317530
Indiana 132328.693790
Iowa 147872.011895
Kansas 375235.640648
Kentucky 547606.382979
Louisiana 137035.123967
Maine 133306.671869
Maryland 102271.986971
Massachusetts 88389.010989
Michigan 159919.384729
Minnesota 85928.502879
Mississippi 119059.689289
Missouri 241528.316524
Montana 162938.269114
Ne
aska 187300.000000
Nevada 113962.097840
New Hampshire 118081.014730
New Jersey 288625.410734
New Mexico 96034.172662
New York 84020.291693
North Carolina 219605.014687
North Dakota 63637.532134
Ohio 163419.427288
Oklahoma 88735.332464
Oregon 73222.656250
Pennsylvania 183454.234713
Rhode Island 37024.793388
South Carolina 228787.354902
South Dakota 43549.618321
Tennessee 257537.815126
Texas 180422.473868
Utah 93321.554770
Vermont 39940.944882
Virginia 200566.893424
Washington 96013.779528
West Virginia 103477.751756
Wisconsin 135069.336521
Wyoming 61679.035250
Name: impressions_avg, dtype: float64
Average Spend USD by Geographic Location:
geo_targeting_included
Alabama 4746.595570
Alaska 6050.800000
Arizona 5064.440179
Arkansas 3118.617021
California 6401.681957
Colorado 2976.734007
Connecticut 421.546392
Delaware 2042.424242
Florida 7394.228483
Georgia 7547.896300
Hawaii 707.117438
Idaho 351.532567
Illinois 4677.205072
Indiana 3842.773019
Iowa 4322.166304
Kansas 6699.337261
Kentucky 11247.176759
Louisiana 3975.000000
Maine 6591.728443
Maryland 2790.716612
Massachusetts 2818.791209
Michigan 4837.842847
Minnesota 2740.642994
Mississippi 3349.550286
Missouri 5048.312645
Montana 5743.887432
Ne
aska 3258.900000
Nevada 3554.120758
New Hampshire 3102.905074
New Jersey 4222.179628
New Mexico 1801.169065
New York 2572.574509
North Carolina 6337.022661
North Dakota 3731.491003
Ohio 3586.720943
Oklahoma 2372.229465
Oregon 3419.531250
Pennsylvania 5423.067793
Rhode Island 591.735537
South Carolina 5921.721413
South Dakota 1025.318066
Tennessee 5809.159664
Texas 4011.324042
Utah 3029.034158
Vermont 1117.027559
Virginia 4099.281935
Washington 1868.175853
West Virginia 2813.875878
Wisconsin 4520.657501
Wyoming 920.871985
Name: spend_usd_avg, dtype: float64
# 3. Increase Transparency in Ad Spending
# Spending Transparency Dashboard
plt.figure(figsize=(10, 10))
sns.barplot(x='spend_usd_avg', y='geo_targeting_included', data=df)
plt.title('Ad Spending by Geographic Location')
plt.xlabel('Geographic Location')
plt.ylabel('Spending (encoded)')
plt.show()
Highest Average spent on advertisement locations are 'kentucky', 'georgia', 'kansas', 'alaska'.
plt.figure(figsize=(10, 10))
sns.barplot(x='impressions_avg', y='geo_targeting_included', data=df)
plt.title('Ad impressions on Geographic Location')
plt.xlabel('Geographic Location')
plt.ylabel('Spending (encoded)')
plt.show()
Highest Average impressions of advertisement on locations are 'kentucky', 'georgia', 'kansas', 'california'.
# df.groupby('geo_targeting_included')['spend_usd_avg'].mean().plot(kind='barh',
# figsize=(10,8), fontsize=10)
# plt.show()
df.groupby('ad_type')['spend_usd_avg'].mean().plot(kind='barh',
figsize=(10,5), fontsize=12)
plt.show()
Highest USD spend on Video ad_type compare to otehr ad_types
df.groupby('ad_type')['impressions_avg'].mean().plot(kind='barh',
figsize=(10,5), fontsize=12)
plt.show()
Highest impressions of ads are on Video ad_type compare to otehr ad_types
# Increase Transparency in Ad Spending
# Create a bar plot using seaborn
plt.figure(figsize=(15, 10))
sns.barplot(data=df, y='geo_targeting_included', x='spend_usd_avg', hue='ad_type', palette='tab10')
plt.title('Average Spend USD by Geo Targeting and Ad Type')
plt.xlabel('Geographic Targeting')
plt.ylabel('Average Spend USD')
plt.legend(title='Ad Type')
plt.show()
# # # 2. Enhance Fairness in Targeted Marketing
def demographic_parity(df, protected_attr):
# Ensure each protected attribute group has similar distribution
group_sizes = df[protected_attr].value_counts().min()
parity_data = df.groupby(protected_attr, group_keys=False).apply(lambda x: x.sample(group_sizes)).reset_index(drop=True)
return parity_data
# Function to visualize distribution
def visualize_distribution(df, protected_attr, title):
plt.figure(figsize=(10, 6))
sns.countplot(data=df, y=protected_attr)
plt.title(title)
plt.xlabel(protected_attr)
plt.ylabel('Count')
plt.show()
# Visualize distribution before applying demographic parity
visualize_distribution(df, 'geo_targeting_included', 'Distribution before Demographic Parity')
# Apply demographic parity
df = demographic_parity(df, 'geo_targeting_included')
# Visualize distribution after applying demographic parity
visualize_distribution(df, 'geo_targeting_included', 'Distribution after Demographic Parity')
# Check the resulting dataframe
print(df.head())
ad_type impressions spend_usd geo_targeting_included impressions_avg \
0 Text <=10k <100 Alabama 10000.0
1 Image <=10k <100 Alabama 10000.0
2 Video 10k-100k 100-1k Alabama 55000.0
3 Text <=10k <100 Alabama 10000.0
4 Video <=10k <100 Alabama 10000.0
spend_usd_avg ...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Looking For Homework Help? Get Help From Best Experts!

Copy and Paste Your Assignment Here