问题描述:

I would like to tidy a panel data excluding all observed IDs that do not have valid observations throughout all periods.

Currently I have a panel df like:

dt <- data.frame(ID1=c(rep(520020,5),rep(520030,3),rep(520040,5),rep(520050,2),rep(520060,5)),

ID2=c(rep(11,5),rep(12,3),14,15,rep(13,2),17,16,16,rep(14,4),20),

t=c(rep(2014:2010,4)),

var1=c(runif(n = 5, min = 1, max = 10),NA,NA,NA,runif(n = 12, min = 1, max = 10)),

var2=c(runif(n = 17, min = 1, max = 10),NA,runif(n = 2, min = 1, max = 10)))

Now would like to generate a df that contains only complete observations of 5 years with the same ID1 and ID2 without any missing values.

In this example this would only apply to the subgroup ID2=11 within the group ID1=520020.

Does anyone have a solution that does not make me do that manually?

网友答案:

For this, I have looped through every unique pair of ID1 and ID2. If any NA values were found, or a unique pair of ID's did not have enough rows, or years other than the one in question were present, the rows were removed.

Keep in mind that this code assumes the "year" or "t" field does not have duplicates for any ID pair.

startYear <- 2010
endYear <- 2014

for( i in unique(dt$ID1)){
  for( j in unique(dt$ID2[dt$ID1 == i])){

    hasNA <- any(is.na(dt[dt$ID1 == i & dt$ID2 == j, ]))

    if(nrow(dt[dt$ID1 == i & dt$ID2 == j, ]) != (endYear - startYear + 1)){
      isIncomplete <- TRUE
    } else if ( !all(dt$t[dt$ID1 == i & dt$ID2 == j] %in% startYear:endYear) ) {
      isIncomplete <- TRUE 
    } else {
      isIncomplete <- FALSE
    }

    if( hasNA | isIncomplete ){
    dt <- dt[dt$ID1 != i | dt$ID2 != j, ]
    }
  }
}
相关阅读:
Top