Econ 382 Econometrics
Instructor: Ma
Fall 2020
Data Project
Due 11:59pm Sunday December 20
Submit your project in a Word document in Blackboard. If you use Pages with a Mac, please export your pages document into a PDF. Only Word or PDF will be accepted.
This project MUST be done individually and independently. Identical or substantially similar work will result in an F for all authors involved.
This project is designed to give you a flavor of how quantitative research is conducted in the real world, and how some of the econometric techniques we discussed in class are applied. There are several statistical packages out there, and you are required to conduct this project using R/RStudio.
This project will involve the following tasks:
1. Find data on the CDC website, find out what variables are available, and how the variables are defined.
2. Formulate an empirical model with six or more variables. What you want to investigate with the model must make sense.
3. Download at least two datasets in SAS transport (XPT) files (the variables you choose must come from two or more datasets).
4. Load the SAS transport data files into RStudio.
5. Merge the datasets into one.
6. Summarize descriptive statistics of the variables of interest.
7. Run a regression or regressions to estimate the coefficients.
8. Estimate the coefficients, and conduct hypothesis tests.
9. Discuss the findings of the results and any problems with the model or the results. For example, are you missing any key variables? Is this likely to lead to omitted variable bias?
10. Write a 5-to-6-page report or memorandum in 12pt Times New Roman double-spaced with an appendix that includes all the commands and outputs from RStudio (this page count does not include the appendix). The report should describe in detail what and how you have done for each and every item listed above. You are expected to be able to use the data to back up your arguments, and tell a complete story.
Where to Get R and RStudio?
Download R at http://www.r-project.org/, and install R on your computer. After installing R, download RStudio at http://www.rstudio.com/, and install RStudio on your computer.
Where to Get Data?
Go to the CDC (Centers for Disease Control and Prevention) homepage at https://www.cdc.gov/. Scroll down, and click on “CDC Organization”. On the CDC organization page, click on “National Center for Health Statistics” (NCHS). Next, under “Population Surveys”, click on “National Health and Nutrition Examination Survey” (NHANES). On the NHANES page, click on “Questionnaires, Datasets, and Related Documentation”, then select the “HNANES 2017-2018” data. There, you will see 5 categories of data that you have access to: Demographics Date, Dietary Data, Examination Date, Laboratory Data, and Questionnaire Data. Each category has multiple datasets (except for Demographics Data). You are required to use at least TWO datasets. I do suggest that you include Demographics Data where you can find most of the common demographics variables such as age, education, income, etc.
How to Download the Datasets in SAS Transport (XPT) Files?
Right click on the XPT links to the right of the datasets that you choose, then select “save link as”, and save them on your flash drive or C: drive. You can rename them as you like.
How to Load the SAS XPT Files into Stata?
Hint: You will need the package “Hmisc”. I have demonstrated in class quite a few times how to install packages. After installing and calling the package, you will need the command “sasxport.get” to open the XPT files.
How to Merge the Datasets?
Each individual in the datasets has an ID number called “seqn” (you can see it in the Variable Lists), and you will merge the datasets by this common variable. Hint: use the “merge” command. Find out by yourself how this command works.
Other.
There may be missing/empty values (shown as NA) or values that do not make sense (for instance 99) of some variables for some observations/individuals. When you do the summary statistics, you will need to deal with them appropriately (for example, you should change those strange values to NA, and tell RStudio to ignore the NA when calculating).