What is your expected salary?

Knowing the salary before going into the interview.

The reason the topic was selected

Why Data Science Salaries?

We selected this topic because it was one that was of interest to all of us including our classmates. As we graduate from this course and enter the job market we have a frame of reference when negotiating offers.

market
The questions to answer with this analysis

What is our expected salary?

The analysis shows the different factors, such as location, years of experience, gender, company, loction, can affect the anticipated salary.

market

Data Source:

Salary and more-Data Scientist, Analyst, Engineer

This dataset has 62,000 salary records from top companies. It contains information such as company, location, education level, compensation (base salary, bonus, stock grants), race, and other details.


What are the skills that a Data scientist must have?


Filter the titles and companies to see what skills are required or highly valued by the current employees.





How external and internal factors affect Data scientist salary?


What are the external factors:

  • Company Locations (States & Cities)
  • Average salary by states: West Coast has higher average wages.


  • Opportunities by company and city: Majority of jobs are located at west and east cost, where many large tech companies are located.



  • Company Size (Numbers of Data Scientist Position)
  • Average Salary by company and number of Data scientist positions: Companies that have more data scientists do not necessarily offer higher salaries.




What are the internal factors:

  • Education level
  • Higher education level tends to have higher salary.

    edusction level
  • Gender
  • Males have higher average salary than Female.

    gender
  • Years of experiences
  • Higher number of years of expericnces tends to have higher salary.

    yearofexperience
  • Data Cleaning and Feature Selection
  • Remove NaN values. Filter out foreign locations. Remove outliers.

    gender



Analysis of Data using Machine Learning Model


Our data cleaning process began by filtering for "Data Scientist" in our data set, which left with about 2500 data entries. Data exploration initially sought to define the target and feature variables. Our target feature was the self-reported total yearly compensation. Our focused features included years of experience, location, company size, and gender, and all columns not realated to these features were eliminated. Through data exploration and graphical displays, we were able to determine outlier threshholds. We eliminated these outliers since the number of data points leftover was still sufficient to train a model. After cleaning the data for outliers, we encoded the categorical values. This preprocessing of data allowed us to test different machine learning models.

We knew that our model needed to be a regression in order to make an actual prediction of a continuous variable. We began by examining a multiple linear regression model. Results from our data indicated that we might need a more sophisticated model. Harnessing the power of decision trees, we then examined the random forest regressor. This allowed us to increase predictive accuracy and control potential over-fitting. The ability to manipulate the number of estimators allowed us to tune the model for more accuracy. Still, we wondered if there was another model that might work better for our data. We briefly explored a neural network, but while this model was subject to overfitting given our smaller sample size, it actually did not perform better than the other models. This is when we started to look at extreme gradient boosting (XGBoost). This ensemble model allows new trees to be added after other trees have learned, therefore minimizing the loss. This model ultimately proved the best, with predictive accuracy above 65% and a lower mean absolute percentage error (MAPE) at .1862.


Side by side comparisions of each of these models revealed a potential to combine their individual predicitions by simple mean. After trying different combinations, the mean of the Random Forest Regressor and XGBoost regressor was able to reduce the MAPE to .1779. This final mean of the two models is what we used in our forecast simulation below.



Salary Estimation


Using our trained machine learning model, salary predictions can help inform interviewees to better negotiate compensation packages for future job opportunities.

Name Education Level Company States Year of experience Expected Salary
Anna Conda Other Netflix CA 0 $177,000
Bat Man Bachelor's Degree Facebook TX 3 $212,000
Crystal Ball Master's Degree Facebook NY 7 $316,000
Dr. Pandas Doctoral Degree Amazon WA 10 $272,000