Costa Rican household poverty level

EDA

data cleaning

surveys

linear regression

multivariate analysis

logistic regression

categorical data

clustering

This dataset is useful for public policy and international development. Using data from a household survey in Costa Rica, you can predict a categorial outcome (level of household poverty) from observable household characteristics. Or, you can come up with your own questions, such as predicting female head of household based on household characterstics.

Author

Selina Carter

Published

July 21, 2025

Data files

• costa_rica_train.csv • costa_rica_test.csv

Data year

2018

Motivation

The Inter-American Development Bank (IDB), the largest source of development financing for Latin America and the Caribbean, is an inter-governmental institution concerned with improving the quality of life across the region. One key challenge is to identify the families that are most in need of assistance from social programs. However, the world’s poorest typically can’t provide the necessary income and expense records to prove that they qualify. In Latin America, one popular method to estimate income qualification is the Proxy Means Test (PMT). With PMT, assistance-providing organizations use a model that considers a family’s observable household attributes—like the material of their walls and ceiling, or the assets found in the home—to classify their level of need.

While this is an improvement, accuracy remains a problem as Latin America’s population grows.

To improve on PMT, the IDB offered this dataset to Kaggle in hopes of garnering new tools and methods for predicting poverty. The hope is that new methods beyond “traditional econometrics” might help improve upon PMT’s performance.

This particular dataset is from a household survey in Costa Rica; personally identifiable information (PII) has been removed.

Data

One row represents one respondent (a person) in our data sample, identified by the Id variable. Multiple people can be part of a single household (identified by the idhogar variable).

In the train dataset, there are 9,557 rows and 134 columns.
In the test dataset, there are 23,856 rows and 133 columns (it’s missing the Target column).

Data preview

costa_rica_train.csv

costa_rica_test.csv

Variable descriptions

Variable	Description	Level of analysis
Id	respondent-level unique identifier	respondent
Target	(only in `costa_rica_train.csv`) ordinal variable indicating groups of income levels:	household
	1 = extreme poverty
	2 = moderate poverty
	3 = vulnerable households
	4 = non vulnerable households
v2a1	monthly rent payment (in Costa Rican colones; for some year \(\le\) 2018)	household
hacdor	=1 if overcrowding by bedrooms	household
rooms	number of all rooms in the house	household
hacapo	=1 if overcrowding by rooms	household
v14a	=1 if has bathroom in the household	household
refrig	=1 if the household has refrigerator	household
v18q	=1 if owns a tablet	respondent
v18q1	number of tablets household owns	household
r4h1	number of males younger than 12 years of age	household
r4h2	number of males 12 years of age and older	household
r4h3	total number of males in the household	household
r4m1	number of females younger than 12 years of age	household
r4m2	number of females 12 years of age and older	household
r4m3	total number of females in the household	household
r4t1	number of people younger than 12 years of age	household
r4t2	number of people 12 years of age and older	household
r4t3	total number of people in the household	household
tamhog	number of total individuals in the household	household
tamviv	number of people living in the household	household
escolari	years of schooling	respondent
rez_esc	years behind in school	respondent
hhsize	number of total individuals in the household	household
paredblolad	=1 if predominant material on the outside wall is block or brick	household
paredzocalo	=1 if predominant material on the outside wall is wood, metal/zinc, or asbestos	household
paredpreb	=1 if predominant material on the outside wall is prefabricated or cement	household
pareddes	=1 if predominant material on the outside wall is waste material	household
paredmad	=1 if predominant material on the outside wall is wood	household
paredzinc	=1 if predominant material on the outside wall is metal (zinc)	household
paredfibras	=1 if predominant material on the outside wall is natural fibers	household
paredother	=1 if predominant material on the outside wall is other	household
pisomoscer	=1 if predominant material on the floor is mosaic, ceramic, or terrace material	household
pisocemento	=1 if predominant material on the floor is cement	household
pisoother	=1 if predominant material on the floor is other	household
pisonatur	=1 if predominant material on the floor is natural material	household
pisonotiene	=1 if no floor at the household	household
pisomadera	=1 if predominant material on the floor is wood	household
techozinc	=1 if predominant material on the roof is metal foil or metal (zinc)	household
techoentrepiso	=1 if predominant material on the roof is fiber cement	household
techocane	=1 if predominant material on the roof is natural fibers	household
techootro	=1 if predominant material on the roof is other	household
cielorazo	=1 if the house has a ceiling (in addition to a roof)	household
abastaguadentro	=1 if water provision exists inside the dwelling	household
abastaguafuera	=1 if water provision exists outside the dwelling	household
abastaguano	=1 if no water provision	household
public	=1 if electricity from public company (CNFL, ICE, or ESPH/JASEC)	household
planpri	=1 if electricity from private plant	household
noelec	=1 if no electricity in the dwelling	household
coopele	=1 if electricity from cooperative	household
sanitario1	=1 if no toilet in the dwelling	household
sanitario2	=1 if toilet connected to sewer or cesspool	household
sanitario3	=1 if toilet connected to septic tank	household
sanitario5	=1 if toilet connected to black hole or latrine	household
sanitario6	=1 if toilet connected to other system	household
energcocinar1	=1 if no main source of energy used for cooking (no kitchen)	household
energcocinar2	=1 if electricity is main source of energy used for cooking	household
energcocinar3	=1 if gas is main source of energy used for cooking	household
energcocinar4	=1 if wood/charcoal is main source of energy used for cooking	household
elimbasu1	=1 if rubbish disposal mainly by tanker truck	household
elimbasu2	=1 if rubbish disposal mainly by burying	household
elimbasu3	=1 if rubbish disposal mainly by burning	household
elimbasu4	=1 if rubbish disposal mainly by dumping in an unoccupied space	household
elimbasu5	=1 if rubbish disposal mainly by dumping in river, creek, or sea	household
elimbasu6	=1 if rubbish disposal mainly other	household
epared1	=1 if walls are bad	household
epared2	=1 if walls are regular	household
epared3	=1 if walls are good	household
etecho1	=1 if roof are bad	household
etecho2	=1 if roof are regular	household
etecho3	=1 if roof are good	household
eviv1	=1 if floor are bad	household
eviv2	=1 if floor are regular	household
eviv3	=1 if floor are good	household
dis	=1 if person is disabled	respondent
male	=1 if male	respondent
female	=1 if female	respondent
estadocivil1	=1 if less than 10 years old	respondent
estadocivil2	=1 if in civil union	respondent
estadocivil3	=1 if married	respondent
estadocivil4	=1 if divorced	respondent
estadocivil5	=1 if separated	respondent
estadocivil6	=1 if widow/er	respondent
estadocivil7	=1 if single (never married)	respondent
parentesco1	=1 if household head	respondent
parentesco2	=1 if spouse/partner	respondent
parentesco3	=1 if son/daughter	respondent
parentesco4	=1 if stepson/daughter	respondent
parentesco5	=1 if son/daughter-in-law	respondent
parentesco6	=1 if grandson/daughter	respondent
parentesco7	=1 if mother/father	respondent
parentesco8	=1 if father/mother-in-law	respondent
parentesco9	=1 if brother/sister	respondent
parentesco10	=1 if brother/sister-in-law	respondent
parentesco11	=1 if other family member	respondent
parentesco12	=1 if other non-family member	respondent
idhogar	household-level identifier	household
hogar_nin	number of children 0 to 19 in household	household
hogar_adul	number of adults in household	household
hogar_mayor	number of individuals 65+ in the household	household
hogar_total	number of total individuals in the household	household
dependency	dependency rate = (number of members of the household younger than 19 or older than 64)/(number of member of household between 19 and 64)	household
edjefe	years of education of male head of household	household
edjefa	years of education of female head of household	household
meaneduc	average years of education for adults (18+)	household
instlevel1	=1 if no education	respondent
instlevel2	=1 if incomplete primary	respondent
instlevel3	=1 if completed primary	respondent
instlevel4	=1 if incomplete academic secondary level	respondent
instlevel5	=1 if complete academic secondary level	respondent
instlevel6	=1 if incomplete technical secondary level	respondent
instlevel7	=1 if complete technical secondary level	respondent
instlevel8	=1 if undergraduate and higher education	respondent
instlevel9	=1 if postgraduate higher education	respondent
bedrooms	number of bedrooms	household
overcrowding	number persons per room	household
tipovivi1	=1 if own and fully paid house	household
tipovivi2	=1 own house and paying in installments	household
tipovivi3	=1 if rented	household
tipovivi4	=1 if precarious ownership of house	household
tipovivi5	=1 if other (such as assigned by government or borrowed)	household
computer	=1 if the household has notebook or desktop computer	household
television	=1 if the household has a TV	household
mobilephone	=1 if respondent has a mobile phone	respondent
qmobilephone	number of mobile phones respondent owns	respondent
lugar1	=1 if Central region	household
lugar2	=1 if Chorotega region	household
lugar3	=1 if Central Pacific region	household
lugar4	=1 if Brunca region	household
lugar5	=1 Huetar Atlantic region	household
lugar6	=1 if North Huetar region	household
area1	=1 if urban area	household
area2	=2 if rural area	household
age	person’s age in years	respondent

Questions

This dataset requires some cleaning. First, you may wish to change the variable names (ex., rename v2a1 to rent_monthly). Some variables are read in as characters but they should be numeric. You may also want to combine binary variables into a single factor variable (ex., combine tipovivi1, …, tipovivi5 into one factor variable). What else is necessary in your opinion?

The dataset has two levels of analysis: household-level and respondent-level. The type of cleaning you do may depend on the level of analysis you want to study. Are you predicting the Target variable? If so, how would you reshape the data? (Hint: you have Id for respondent-level analysis and idhogar for household-level analysis). For your convenience, below you will find code that identifies the level of analysis of each column.

df_train <- read.csv("../data/costa_rica_train.csv")
df_test <- read.csv("../data/costa_rica_test.csv")

# Respondent-level variables
resp_vars <- c("Id", "v18q", "escolari", "rez_esc", "male", "female",
               paste0("estadocivil", 1:7), paste0("parentesco", 1:12),
               paste0("instlevel1", 1:9), "mobilephone", "qmobilephone",
               "age")

# houshold-level variables
hh_vars_train <- names(df)[!(names(df) %in% resp_vars)]
hh_vars_test <- hh_vars_train[hh_vars_train != "Target"]

# Number of unique respondents in the training set
length(unique(df_train$Id))

[1] 9557

# Number of unique households in the training set
length(unique(df_train$idhogar))

[1] 2988

This dataset has many missing values. You will have to understand what these missing values mean—are they missing at random? Are they missing because that question doesn’t apply? For example, compare v2a1 (monthly rent payment) and the tipovivi# (living arrangement) variables. Below you will find example code on a way to examine missing variables compared to the Target (poverty level) category. Can you make the same plot with tipovivi#? (Hint: you have to create a new factor variable indicating which type of living arrangement, i.e., tipovivi1, …, tipovivi5).

library(dplyr)
library(tidyr)
library(ggplot2)

df <- df_train

# Run this line below if you want to analyze all the variables
vars <- colnames(df)[!(colnames(df)%in% c("Target"))]

# Uncomment and run this line to analyze only the rent and living arrangement-related variables
# vars = c("v2a1", paste0("tipovivi", 1:5))

df[vars] <- lapply(df[vars], as.character)

missings_df <- pivot_longer(df, vars, names_to = "indicator") |>
  mutate(missing = ifelse(is.na(value), 1, 0))

missings_grouped <- missings_df |>
  group_by(indicator) |>
  summarize(count = n(),
            percent_missing = sum(missing) / count) |>
  filter(percent_missing > 0.01) # Threshold: at least 1% of missing cases

missings_grouped_target <- missings_df |>
  group_by(Target, indicator) |>
  summarize(count = n(),
            percent_missing = sum(missing)/count) |>
  filter(indicator %in% missings_grouped$indicator)

p1 <- ggplot(missings_grouped,
             aes(x = indicator, y = 100*percent_missing, fill = 100*percent_missing)) +
  geom_bar(stat = "identity", aes(color = I("black"))) +
  geom_text(aes(label = paste0(round(100*percent_missing, 1), "%")),
            hjust = -0.1,  # shift labels slightly to the right of the bar
            size = 4) +
  ylim(0, 100) +
  ylab("% missing") +
  scale_fill_gradient(low = "blue", high = "orange") +
  ggtitle("Missing data (%)") +
  theme_bw() +
  guides(fill = guide_legend("% missing")) +
  theme(legend.position = "left") +
  theme(axis.text.y = element_blank(),
        axis.title = element_text(size = 12, face = "bold")) +
  xlab("") +
  coord_flip() +
  theme(legend.position = "none")

p2 <- ggplot(missings_grouped_target,
             aes(x = indicator, y = Target, fill= 100*percent_missing)) +
  geom_tile() +
  geom_text(aes(label = paste0(round(100*percent_missing, 1), "%")),
            color = "white", size = 4) +   # label inside each tile
  ylab("Target\n(1 = extreme poverty, 4 = non-vulnerable)") +
  ggtitle("Missing data, by Target group") +
  theme_bw() +
  guides(fill = guide_legend("% missing")) +
  scale_fill_gradient(low = "blue", high = "orange") +
  theme(axis.text = element_text(size = 12),
        axis.text.y = element_text(hjust = 0),
        axis.title = element_text(size = 12,face = "bold")) +
  xlab("") +
  coord_flip()

gridExtra::grid.arrange(p1, p2, ncol=2, widths = c(4,7))

Suppose you want to predict the cagtegorical outcome Target. What variables are the best predictors?
Suppose you want to conduct an unsupervised analysis of the test data with \(k=4\) clusters. What type of algorithm could you use? Do these clusters align with the predicted Target groups from your model in Question 4?
What other types of questions could you ask about the data? How might this be useful for public policy and social programs?

References

Fabián Sánchez, Gary Soto, Julia Elliott, Luis Tejerina, and Phil Culliton. Costa Rican Household Poverty Level Prediction. https://kaggle.com/competitions/costa-rican-household-poverty-prediction, 2018. Kaggle. Redistributed with permission from IDB.