Econometric Methods Problem Set 3
Directions: Instructions:
For all parts of this problem use the Wage2 BCUSE dataset.
Run a simple regression of IQ (as dependent) on educ. Call the coefficient on educ (δ_1 ) ̃. What is: (δ_1 ) ̃?
Run a simple regression of log(wage) (as dependent) on educ. Call the coefficient on educ (β_1 ) ̃. What is:(β_1 ) ̃?.
Run a multiple regression of log(wage) on educ and IQ. Call the coefficient on educ (β_1 ) ̌ and on IQ (β_2 ) ̌. What are the two coefficients (β_1 ) ̌ and (β_2 ) ̌?
Does the equation we discussed in class on omitted variable bias actually hold in this example? (Hint: Lecture 9 has the equation.)
Does what you found imply the coefficient you found in part b is biased? If so explain why (i.e., what assumption is violated and why)?
Consider the equation below:
log(salary)=β_0+β_1 log(sales)+β_2 roe+β_3 ros+u
Where roe is returns on equity and ros is return on the firm’s stock.
In terms of the models parameters, state the null hypothesis that after controlling for sales and roe the variable ros has no impact on salary.
Use the BCUSE dataset ceosal1. Estimate the regression above. By what percent is salary predicted to increase if ros increases by 50? (Express as a decimal rounded to three places (i.e., 50 percent is 0.500).
Derive (or report from STATA) the t-statistic for a test that ros has no effect on log(salary) vs. the alternative that it is greater than 0.
Would you reject the null at the 10 percent level of significance?
After doing this test, would you continue to include ros in your regression of firm performance on CEO salary? Explain.
Use the BCUSE dataset mlb1 to estimate the equation in STATA discussed in lecture:
log(salary)=β_0+β_1 years+β_2 gamesyr+β_3 bavg+β_4 hrunsyr+β_5 rbisyr+u_i
Use STATA to find the correlation between hrunsyr and rbisyr.
Given this, why is it not particularly surprising neither result is statistically different than 0 when you run this regression?
Drop the variable rbisyr. What happens to the p-value of hrunsyr?
Why does dropping rbisyr make this kind of difference?
While still leaving out rbisyr, wow run the regression adding the variables runsyr, fldperc, sbasesyr Are any of the variables statistically significant at the five percent level?
Use an F-test to see if those three variables are jointly significant.
What is the F-stat?
What is the critical value for 1-percent significance for this test (table in book):
Are the variables jointly significant?
The following model allows the return to education to depend on the total amount of parent’s education, called pareduc:
log(wage)=β_0+β_1 educ+β_2 educ*pareduc+β_3 exper+β_4 tenure+u
What is the percent increase in the wage of a one year change in educ (expressed in terms of model parameters, i.e., in symbols)?
Do you have an expectation on the sign of β_2? Explain your intuition:
Use the data in the BCUSE dataset WAGE2 to estimate the equation. You will need to generate the variable parent’s education as the sum of meduc and feduc (mother’s and father’s education respectively). Whenever you run a regression, you should know if any observations are left out of the regression (e.g., if you have 500 observations but only 440 show up in the regression). Are any of the observations present in the dataset dropped out of the regression? Can you find out why? (Hint: the tab command and the “missing” option can be useful).
Given this regression, what is the return to an extra year of education when parent’s education is 24?
Real World Problem – Gender Representation and Pay in Large Companies. You will be using the Wharton Research Data Services, which contains a host of financial data. For this exercise, we will be focusing on the issue of representation in executive positions within companies by gender. Follow the steps below to get the data you will need.
Go to the Wharton Research Data Services website at https://wrds-www.wharton.upenn.edu/ and login using our class’s login, with username “econ331701sprg” and password “olsregression.” [It should not ask for Duo Two-factor authentication, but let me know if it does]. Click on the “Compustat – Capital IQ” subscription. Then click on “Execucomp” and then “Annual Compensation.”
For Step 1, select the date range 2010 to 2020 (make sure you stop in 2020 so you get the same answers I have…do not download 2021 or 2022 data, which may still be updating with late submissions). For Step 2, select “Search the Entire Database.” In Step 3, pick the following variables: Company Name, Industry (NAICS), Gender, Age (it says Director Age, I’m not sure why), CEOANN (variable describing if CEO), CFOANN (variable describing if CEO), TDC1 (total compensation), and Year (year observation is from). In Step 4, make sure to select STATA and uncompressed (date format shouldn’t matter). The dataset should be about 13MB. Save it to your L:.
It is always a good idea to know what data you are working with. In 3-4 sentences, describe the Executive Compensation Dataset. Who collects it? Which companies are included? Which executives at those companies?
You are interested in female representation in high-level executives at large companies. What fraction of the executives in the dataset are female? Input your answer as a decimal rounded to two digits (e.g., 12.4 percent is input as 0.12).
You are also interested in how this percent has changed over time. Make a line graph of the share of all executives in the dataset that were women, with year on the x-axis. Do not make the graph in STATA. Instead, make it in excel following the format provided on the assignment page by using a block of code within a do file similar to this:
preserve (preserve tells STATA to save the data temporarily)
collapse (mean) female, by(YEAR) (collapse tells STATA to do some operation like taking the means within each year and collapse the data to the (in this case) yearly level. Your dataset is now very small.)
export excel using “Your L:”, replace firstrow(var)
restore (now your data is back)
Now, you should have data in excel with the average share of females in the dataset each year. Make the graph according to the format and label the first and last points to illustrate the change over the period. Upload your graph as a Microsoft Word Document, with an appropriate title and sourcing. (3 points)
The CEO and CFO annual flags indicate whether this particular executive is the CEO or CFO at that time. Generate indicators for the executive being a CEO or a CFO. What share of executives in the data are CEOs? Input your answer as a decimal rounded to two digits (e.g., 12.4 percent is input as 0.12).
Generate a variable for log TDC1 (total executive compensation), and age squared. Run a regression of this logged variable on female, ceo, cfo, age, age squared, and indicators for each year (you can do this simply by typing i.year). According to the regression what is the percent reduction in salary for females in the same position as a man, of the same age, and in the same year? Input your answer as a decimal rounded to two digits (e.g., 12.4 percent is input as 0.12).
At what age does the regression suggest each year of age is associated with a decrease in salary? Round to nearest year.
You wonder if women executives are in different lower paying industries, and that might account for the salary difference. Run a regression controlling for industry (hint: using xi: regress and then i.NAICS would be wise). Write 1-2 sentences on whether you think industry accounted for the difference (i.e., does the coefficient on gender move much closer to zero)?
Do you think we should interpret this coefficient causally? Any variables we might be missing? Write 3-4 sentences.