How to: Handling Missing Values in R by advanced R Dataframes Subsetting

Missing values is one of the most common issues in data-sets. Data scientists, during data wrangling phase, spend significant amount of time in handling the missing values. Though there are a number of approaches to handle the missing values, including imputation techniques, but in many scenarios, it just makes sense to discard those columns from dataframe with high number of missing values. Specifically, when your data-frame has a lot of columns, this issue becomes even more critical as it requires profiling each column to find whether the number of missing values in each column is lesser than a threshold can be quite time consuming. In this post, we are going to demonstrate how you can achieve this task proficiently and quickly.

Subsetting R Dataframe to filter columns with less number of missing values:

Let’s consider airquality dataframe that comes with R. The first task would be to see how many missing columns are there in each column. It can be done by creating a temporary dataframe which contains number of missing values in each column as:

> s1<-sort(sapply(airquality,function(x)sum(is.na(x)))) 

> s1   

 Wind    Temp   Month     Day    Solar.R   Ozone       

   0      0       0        0       7         37    

 

We’ve made use of apply function and passed airquality dataframe as argument. It operated on each column of the dataframe and applied the function specified in its second argument to find the number of missing values.

 

It shows that Solar.R column has 7 missing values and Ozone column has 37!

Lets  say that you have formulated a threshold that if a column has missing values > 10, you will just discard it. Here’s how you can do it:

> feat_almost_full<-names(s1)[s1<10] 

 

we retrieved names of those columns which have missing values less than 10

> feat_almost_full 

[1] "Wind"    "Temp"    "Month"   "Day"     "Solar.R"

as you can see that in this vector, Ozone, the faulty column, has been removed. Now if we subset our dataframe:

airquality_subset<-airquality[,feat_almost_full]

we just get those columns in our main dataframe which have missing values less than our threshold i.e. 10 in this case.

Hope that you will find this technique quite handy in your data analysis tasks. Do check out my R Programming course on Udemy which covers R programming constructs with comprehensive depth.

Stay connected.