Costa Rican household poverty level

EDA
data cleaning
surveys
linear regression
multivariate analysis
logistic regression
categorical data
clustering
This dataset is useful for public policy and international development. Using data from a household survey in Costa Rica, you can predict a categorial outcome (level of household poverty) from observable household characteristics. Or, you can come up with your own questions, such as predicting female head of household based on household characterstics.
Author

Selina Carter

Published

July 21, 2025

Data files
Data year

2018

Motivation

The Inter-American Development Bank (IDB), the largest source of development financing for Latin America and the Caribbean, is an inter-governmental institution concerned with improving the quality of life across the region. One key challenge is to identify the families that are most in need of assistance from social programs. However, the world’s poorest typically can’t provide the necessary income and expense records to prove that they qualify. In Latin America, one popular method to estimate income qualification is the Proxy Means Test (PMT). With PMT, assistance-providing organizations use a model that considers a family’s observable household attributes—like the material of their walls and ceiling, or the assets found in the home—to classify their level of need.

While this is an improvement, accuracy remains a problem as Latin America’s population grows.

To improve on PMT, the IDB offered this dataset to Kaggle in hopes of garnering new tools and methods for predicting poverty. The hope is that new methods beyond “traditional econometrics” might help improve upon PMT’s performance.

This particular dataset is from a household survey in Costa Rica; personally identifiable information (PII) has been removed.

Data

One row represents one respondent (a person) in our data sample, identified by the Id variable. Multiple people can be part of a single household (identified by the idhogar variable).

  • In the train dataset, there are 9,557 rows and 134 columns.
  • In the test dataset, there are 23,856 rows and 133 columns (it’s missing the Target column).

Data preview

costa_rica_train.csv

costa_rica_test.csv

Variable descriptions

Variable Description Level of analysis
Id respondent-level unique identifier respondent
Target (only in costa_rica_train.csv) ordinal variable indicating groups of income levels: household
1 = extreme poverty
2 = moderate poverty
3 = vulnerable households
4 = non vulnerable households
v2a1 monthly rent payment (in Costa Rican colones; for some year \(\le\) 2018) household
hacdor =1 if overcrowding by bedrooms household
rooms number of all rooms in the house household
hacapo =1 if overcrowding by rooms household
v14a =1 if has bathroom in the household household
refrig =1 if the household has refrigerator household
v18q =1 if owns a tablet respondent
v18q1 number of tablets household owns household
r4h1 number of males younger than 12 years of age household
r4h2 number of males 12 years of age and older household
r4h3 total number of males in the household household
r4m1 number of females younger than 12 years of age household
r4m2 number of females 12 years of age and older household
r4m3 total number of females in the household household
r4t1 number of people younger than 12 years of age household
r4t2 number of people 12 years of age and older household
r4t3 total number of people in the household household
tamhog number of total individuals in the household household
tamviv number of people living in the household household
escolari years of schooling respondent
rez_esc years behind in school respondent
hhsize number of total individuals in the household household
paredblolad =1 if predominant material on the outside wall is block or brick household
paredzocalo =1 if predominant material on the outside wall is wood, metal/zinc, or asbestos household
paredpreb =1 if predominant material on the outside wall is prefabricated or cement household
pareddes =1 if predominant material on the outside wall is waste material household
paredmad =1 if predominant material on the outside wall is wood household
paredzinc =1 if predominant material on the outside wall is metal (zinc) household
paredfibras =1 if predominant material on the outside wall is natural fibers household
paredother =1 if predominant material on the outside wall is other household
pisomoscer =1 if predominant material on the floor is mosaic, ceramic, or terrace material household
pisocemento =1 if predominant material on the floor is cement household
pisoother =1 if predominant material on the floor is other household
pisonatur =1 if predominant material on the floor is natural material household
pisonotiene =1 if no floor at the household household
pisomadera =1 if predominant material on the floor is wood household
techozinc =1 if predominant material on the roof is metal foil or metal (zinc) household
techoentrepiso =1 if predominant material on the roof is fiber cement household
techocane =1 if predominant material on the roof is natural fibers household
techootro =1 if predominant material on the roof is other household
cielorazo =1 if the house has a ceiling (in addition to a roof) household
abastaguadentro =1 if water provision exists inside the dwelling household
abastaguafuera =1 if water provision exists outside the dwelling household
abastaguano =1 if no water provision household
public =1 if electricity from public company (CNFL, ICE, or ESPH/JASEC) household
planpri =1 if electricity from private plant household
noelec =1 if no electricity in the dwelling household
coopele =1 if electricity from cooperative household
sanitario1 =1 if no toilet in the dwelling household
sanitario2 =1 if toilet connected to sewer or cesspool household
sanitario3 =1 if toilet connected to septic tank household
sanitario5 =1 if toilet connected to black hole or latrine household
sanitario6 =1 if toilet connected to other system household
energcocinar1 =1 if no main source of energy used for cooking (no kitchen) household
energcocinar2 =1 if electricity is main source of energy used for cooking household
energcocinar3 =1 if gas is main source of energy used for cooking household
energcocinar4 =1 if wood/charcoal is main source of energy used for cooking household
elimbasu1 =1 if rubbish disposal mainly by tanker truck household
elimbasu2 =1 if rubbish disposal mainly by burying household
elimbasu3 =1 if rubbish disposal mainly by burning household
elimbasu4 =1 if rubbish disposal mainly by dumping in an unoccupied space household
elimbasu5 =1 if rubbish disposal mainly by dumping in river, creek, or sea household
elimbasu6 =1 if rubbish disposal mainly other household
epared1 =1 if walls are bad household
epared2 =1 if walls are regular household
epared3 =1 if walls are good household
etecho1 =1 if roof are bad household
etecho2 =1 if roof are regular household
etecho3 =1 if roof are good household
eviv1 =1 if floor are bad household
eviv2 =1 if floor are regular household
eviv3 =1 if floor are good household
dis =1 if person is disabled respondent
male =1 if male respondent
female =1 if female respondent
estadocivil1 =1 if less than 10 years old respondent
estadocivil2 =1 if in civil union respondent
estadocivil3 =1 if married respondent
estadocivil4 =1 if divorced respondent
estadocivil5 =1 if separated respondent
estadocivil6 =1 if widow/er respondent
estadocivil7 =1 if single (never married) respondent
parentesco1 =1 if household head respondent
parentesco2 =1 if spouse/partner respondent
parentesco3 =1 if son/daughter respondent
parentesco4 =1 if stepson/daughter respondent
parentesco5 =1 if son/daughter-in-law respondent
parentesco6 =1 if grandson/daughter respondent
parentesco7 =1 if mother/father respondent
parentesco8 =1 if father/mother-in-law respondent
parentesco9 =1 if brother/sister respondent
parentesco10 =1 if brother/sister-in-law respondent
parentesco11 =1 if other family member respondent
parentesco12 =1 if other non-family member respondent
idhogar household-level identifier household
hogar_nin number of children 0 to 19 in household household
hogar_adul number of adults in household household
hogar_mayor number of individuals 65+ in the household household
hogar_total number of total individuals in the household household
dependency dependency rate = (number of members of the household younger than 19 or older than 64)/(number of member of household between 19 and 64) household
edjefe years of education of male head of household household
edjefa years of education of female head of household household
meaneduc average years of education for adults (18+) household
instlevel1 =1 if no education respondent
instlevel2 =1 if incomplete primary respondent
instlevel3 =1 if completed primary respondent
instlevel4 =1 if incomplete academic secondary level respondent
instlevel5 =1 if complete academic secondary level respondent
instlevel6 =1 if incomplete technical secondary level respondent
instlevel7 =1 if complete technical secondary level respondent
instlevel8 =1 if undergraduate and higher education respondent
instlevel9 =1 if postgraduate higher education respondent
bedrooms number of bedrooms household
overcrowding number persons per room household
tipovivi1 =1 if own and fully paid house household
tipovivi2 =1 own house and paying in installments household
tipovivi3 =1 if rented household
tipovivi4 =1 if precarious ownership of house household
tipovivi5 =1 if other (such as assigned by government or borrowed) household
computer =1 if the household has notebook or desktop computer household
television =1 if the household has a TV household
mobilephone =1 if respondent has a mobile phone respondent
qmobilephone number of mobile phones respondent owns respondent
lugar1 =1 if Central region household
lugar2 =1 if Chorotega region household
lugar3 =1 if Central Pacific region household
lugar4 =1 if Brunca region household
lugar5 =1 Huetar Atlantic region household
lugar6 =1 if North Huetar region household
area1 =1 if urban area household
area2 =2 if rural area household
age person’s age in years respondent

Questions

  1. This dataset requires some cleaning. First, you may wish to change the variable names (ex., rename v2a1 to rent_monthly). Some variables are read in as characters but they should be numeric. You may also want to combine binary variables into a single factor variable (ex., combine tipovivi1, …, tipovivi5 into one factor variable). What else is necessary in your opinion?

  2. The dataset has two levels of analysis: household-level and respondent-level. The type of cleaning you do may depend on the level of analysis you want to study. Are you predicting the Target variable? If so, how would you reshape the data? (Hint: you have Id for respondent-level analysis and idhogar for household-level analysis). For your convenience, below you will find code that identifies the level of analysis of each column.

    df_train <- read.csv("../data/costa_rica_train.csv")
    df_test <- read.csv("../data/costa_rica_test.csv")
    
    # Respondent-level variables
    resp_vars <- c("Id", "v18q", "escolari", "rez_esc", "male", "female",
                   paste0("estadocivil", 1:7), paste0("parentesco", 1:12),
                   paste0("instlevel1", 1:9), "mobilephone", "qmobilephone",
                   "age")
    
    # houshold-level variables
    hh_vars_train <- names(df)[!(names(df) %in% resp_vars)]
    hh_vars_test <- hh_vars_train[hh_vars_train != "Target"]
    
    # Number of unique respondents in the training set
    length(unique(df_train$Id))
    [1] 9557
    # Number of unique households in the training set
    length(unique(df_train$idhogar))
    [1] 2988
  3. This dataset has many missing values. You will have to understand what these missing values mean—are they missing at random? Are they missing because that question doesn’t apply? For example, compare v2a1 (monthly rent payment) and the tipovivi# (living arrangement) variables. Below you will find example code on a way to examine missing variables compared to the Target (poverty level) category. Can you make the same plot with tipovivi#? (Hint: you have to create a new factor variable indicating which type of living arrangement, i.e., tipovivi1, …, tipovivi5).

    library(dplyr)
    library(tidyr)
    library(ggplot2)
    
    df <- df_train
    
    # Run this line below if you want to analyze all the variables
    vars <- colnames(df)[!(colnames(df)%in% c("Target"))]
    
    # Uncomment and run this line to analyze only the rent and living arrangement-related variables
    # vars = c("v2a1", paste0("tipovivi", 1:5))
    
    df[vars] <- lapply(df[vars], as.character)
    
    missings_df <- pivot_longer(df, vars, names_to = "indicator") |>
      mutate(missing = ifelse(is.na(value), 1, 0))
    
    missings_grouped <- missings_df |>
      group_by(indicator) |>
      summarize(count = n(),
                percent_missing = sum(missing) / count) |>
      filter(percent_missing > 0.01) # Threshold: at least 1% of missing cases
    
    missings_grouped_target <- missings_df |>
      group_by(Target, indicator) |>
      summarize(count = n(),
                percent_missing = sum(missing)/count) |>
      filter(indicator %in% missings_grouped$indicator)
    
    p1 <- ggplot(missings_grouped,
                 aes(x = indicator, y = 100*percent_missing, fill = 100*percent_missing)) +
      geom_bar(stat = "identity", aes(color = I("black"))) +
      geom_text(aes(label = paste0(round(100*percent_missing, 1), "%")),
                hjust = -0.1,  # shift labels slightly to the right of the bar
                size = 4) +
      ylim(0, 100) +
      ylab("% missing") +
      scale_fill_gradient(low = "blue", high = "orange") +
      ggtitle("Missing data (%)") +
      theme_bw() +
      guides(fill = guide_legend("% missing")) +
      theme(legend.position = "left") +
      theme(axis.text.y = element_blank(),
            axis.title = element_text(size = 12, face = "bold")) +
      xlab("") +
      coord_flip() +
      theme(legend.position = "none")
    
    p2 <- ggplot(missings_grouped_target,
                 aes(x = indicator, y = Target, fill= 100*percent_missing)) +
      geom_tile() +
      geom_text(aes(label = paste0(round(100*percent_missing, 1), "%")),
                color = "white", size = 4) +   # label inside each tile
      ylab("Target\n(1 = extreme poverty, 4 = non-vulnerable)") +
      ggtitle("Missing data, by Target group") +
      theme_bw() +
      guides(fill = guide_legend("% missing")) +
      scale_fill_gradient(low = "blue", high = "orange") +
      theme(axis.text = element_text(size = 12),
            axis.text.y = element_text(hjust = 0),
            axis.title = element_text(size = 12,face = "bold")) +
      xlab("") +
      coord_flip()
    
    gridExtra::grid.arrange(p1, p2, ncol=2, widths = c(4,7))

  4. Suppose you want to predict the cagtegorical outcome Target. What variables are the best predictors?

  5. Suppose you want to conduct an unsupervised analysis of the test data with \(k=4\) clusters. What type of algorithm could you use? Do these clusters align with the predicted Target groups from your model in Question 4?

  6. What other types of questions could you ask about the data? How might this be useful for public policy and social programs?

References

Fabián Sánchez, Gary Soto, Julia Elliott, Luis Tejerina, and Phil Culliton. Costa Rican Household Poverty Level Prediction. https://kaggle.com/competitions/costa-rican-household-poverty-prediction, 2018. Kaggle. Redistributed with permission from IDB.