Profile Image

This is a summary of my working experience, technical skills and subject matter expertise. If you need a copy of my resume please email me directly so that I can forward you a copy.

I am a statistician. After I obtainded my doctorate in Statistics I taught Mathematics and Statistics in Northern Arizona University for five years, before I turned to industry and spent the next twenty plus years in healthcare industry, primarily in health plans for Medicare and Medicaid.

I’m passionate about predictive analytics and machine learning, and with my statistical background I strongly believe that I can become a data scientist. So in January 2018 I enrolled in the DevMasters - Mastering Applied Data Science program - and I successfully completed the program in April 2018. Below I describe briefly the program and skills I have acquired.

Section One : as a Data Scientist

The DevMasters is a training company headquartered in Irvine, California. It has been training data scientists since 2012. Mastering Applied Data Science course combines both Data Science Bootcamp and project-based learning to give students not only a solid foundation in data science, but also project experience.

The program comprises of in-person classes on six weekends, homework assignments, project-based learning projects and presentation, and Kaggle competition participation. The program lasts about four months. Before the first session, there are PreWork: Python 101, Statistics 101, Web 101 and SQL 101, to bring students up to speed.

Session 1: Introduction to Data Science with Python We learn some intermediate functions in Python; the expected mindset of a data scientist; how to take full advantage of the program by using the Applied labs environments.

Session 2: Visualization & Exploratory Data Analysis We review some statistical concepts and introduction of NumPy and Pandas; learn how to clean and manipulate data, and perform exploratory data analysis; we learn how to do web-scraping and extract data from APIs.

Session 3: Data Mining Utilizing NumPy & Pandas We dive deeper into more advanced statistical techniques to clean and munge data using NumPy and Pandas; learn to use Matplotlib and Seaborn packages to visualize data and identify trends.

Session 4: Machine Learning We learn the difference between supervised and unsupervised machine learning; We start with regresssion model building: linear, ridge, lasso and gradient boosting model; followed by classification model building.

Session 5: Advance Machine Learning We dive deep into classification models: Naive Bayes, Logistic regression, Decision Trees, RandomForest, Support Vector Machine; learn to use appropriate metrics to compare model performance, such as R-square, Root-Mean-Squared error, and scoring methods, such as precision, recall, sensitivity and specificity in a confusion matrix; accuracy score, Area Under the Curve (AUC) and Receiver Operating Curve (ROC).

Session 6: Hack Day We learn how to combine training and test datasets from Kaggle site to build a model from end to end: data cleaning, feature engineering, model selection, and finally submit one set of predicted values to see how the model we built compared to those built by others in the country.

Session 7: Recommendation Systems We learn what a recomendation system is and how it can help an e-Commerce, long-tail phenomenon; We learn such techniques as user-based nearest-neighbor, Pearson correlation similarity measure, memory-based, model-based, item-based nearest-neighbor, collaborative filtering, cosine similarity measure, data sparsity problem and recent method on Singular Value Decomposition (SVD).

Session 8: Natural Language Processing and Sentiment Analysis We learn Natural Language Processing toolkit to process and extract text data; learn to perform sentiment analysis; work on twitter feeds and Yelp review projects

Session 9: Big Data with Spark & Splunk We are introduced to Big Data and Hadoop ecosystem, the MapReduce paradigm. The existing MovieLens project was transfered to AWS to see the difference

Session 10: Deep Learning and Time Series We are introduced to Deeep Learning and training neural network and visualizing what a neural network is and understand TensorFlow playground.

Session 11: Computer Vision with OpenCV and Hack Project We are introduced to computer vision fundamentals using OpenCV to detect faces, people, cars and other objects.

Session 12: Hack Day In the last session we participate in a Kaggle competition on advanced housing prediction in Ames, Iowa. Every student has to submit one set of predicted values to assess his or her level of skills learned in the class.

Added-values

I find the following experiences are most valuable, that one would not have learned from online courses:

  1. Web-scraping;
  2. twitter feeds analysis using Natural Language Processing (NLP);
  3. how to participate in Kaggle competition;
  4. how to create a GitHub page to braodcast my learning experience and projects

The datasets we have worked on include Titanic dataset, Iris dataset, King County housing price prediction, Election Day project, Yelp vote review and Stars (NLP), NBA 2014-15 player statistics (KNN predition), Glass identification. The six project-based learning projects include service disruption prediction, loan default prediction, advanced housing prediction (Ames, Iowa, a Kaggle competition), predicting device failure, twitter feeds using NLP and recommender systems.

Section Two : as a Healthcare Statistician

Senior Survey Data Analyst

  1. Provider Satisfaction Survey
  2. CAHPS - member satisfaction survey

Quality Health Analytics, director

  1. NCQA/HEDIS
  2. Program effectiveness evaluation
  3. SF-12 analysis for complex case management
  4. Identification of impactable members for care management using predictive models
  5. Social Determinants of Health (SDOH) - Area Deprivation Index (ADI) for Medicaid population

Section Three : Domain Knowledge

  1. Healthcare - Medicare and Medicaid, Marketplace
  2. Financial - Incurred But Not Reported (IBNR) reserve projection; Loan default classification
  3. Time series programming and modeling (SAS and R)
  4. Quality Health Analytics: HEDIS/NCQA - rate proejction using time-series
  5. Program evaluation: TWANG and FastDR (R packages) to adjust for selection bias
  6. Natural Language Processing (NLP) and Sentiment Analysis
  7. Medicare Risk Adjustment - CMS-HCC model
  8. Medicaid Risk Adjustment - CDPS, MRx and CDPS+MRx combined model
  9. Marketplace Risk Adjustment - HHS-HCC model
  10. Sample size determination and Power calculation; statistical hypothesis testing
  11. Bayesian statistics
  12. Sampling design and survey data analysis

Section Four : Programming Skills

  1. Python/Jupyter Notebook
  2. R/RStudio - Rmarkdown for reproducible research
  3. SAS
  4. SQL
  5. Stata
  6. Microsoft Office
  7. Healthcare claim data
rss facebook twitter github youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora