The purpose of the phase 1 report is to find a medical condition for which you want to conduct data analyses and to have a basic understanding of the medical condition, such as the background, causes of the disease, treatments, status, and influential factors. You need data to conduct data analyses. Therefore, you need to find a dataset with which you can conduct your project. I strongly encourage you to use the Medical Expenditure Panel Survey (MEPS) data in this class, while I provide a list of potential datasets below. It should be noted that I will provide sample codes only for the MEPS data. For the data analysis, you should have at least 500 data instances. Therefore, please find a medical condition of interest and then check whether there are 500 data instances. More details will be explained below.
MEPS is a set of large-scale surveys of families and individuals, their medical providers, and employers across the United States. MEPS is a complete source of data on the cost and use of health care and health insurance coverage in the US. For more details, please visit the official website (Links to an external site.) and Github repository (Links to an external site.) and read the attached appendix document Download attached appendix document(p.1 – 13).
MEPS data consists of various variables such as medical condition, socioeconomic factors (e.g., gender, region, race, and family income), and medical expenditure. MEPS data also consist of various files such as person-level (e.g., health status, demographics, and total $$ of care), event-level (e.g., healthcare service use), and condition-level (e.g., medical condition). For the full review of those variables, please look at codebooks (person-level (Links to an external site.)) and condition-level (Links to an external site.)). I also coded those variables regarding usefulness for analysis (included vs. excluded, Heejun_Inclusion field) and variable type (independent vs. dependent, Heejun_Variable_Type field). You can find my version of the codebook from this link (Links to an external site.). In particular there are some dependent variables you can utilize:
Total health expenditures
Total inpatient expenditures
Total emergency care expenditures
Severity of Illness (attacks/year)
Number of School Days Missed (Children)
Number of Work Days Missed (Adult)
You should explore the dataset in depth to understand what you can do and to decide what you will do. It is a complex dataset, and you need to merge a number of files into one for your project. Do not feel overwhelmed. I will introduce all procedures step by step.
Depending on the medical condition (e.g., allergic rhinitis), research goals you can set will include but are not limited to:
Predict the yearly medical expenditure of persons with allergic rhinitis
Compare healthcare costs in different social determinant factors (e.g., sex, region, family income, and race)
Find relationships between allergic rhinitis and environmental factors
If you are skilled in Python and want to use other datasets, then it is up to your group. However, please be noted that I cannot fully support your group, and my sample codes should be tweaked a lot to reflect the difference between the dataset you choose and the MEPS dataset. The following datasets are publicly available and free:
National Health and Nutrition Examination Survey (Links to an external site.)
Early Childhood Longitudinal Studies Program (Links to an external site.)
Add Health (Links to an external site.)
FDA Adverse Event Reporting System (FAERS) (Links to an external site.)
What to Do for the Phase 1 Report
Please find a medical condition that interests your group and find an ICD-9 code of the condition from the FINDACODE.COM (Links to an external site.)
If your group decides to use the MEPS data, go to #3. If not, go to #5
Download a condition-level file (h128.csv