问题描述:

How to sample a data into training and testing data whilst ensuring that every column has a value?

My idea was to do something like this;

data(iris)

random.sample = function(df){

repeat {

# do something

ind = sample(2, nrow(df), replace = TRUE, prob=c(0.8, 0.2))

df1 = df[ind == 1,]

b = data.frame(colSums(df1))

b = min(b[,1])

df2 = df[ind == 2,]

c = data.frame(colSums(df2))

c = min(c[,1])

# check for success

check = sum(a,b)

if(check>0.01) break

}

ind

} #this function makes sure that every trait has a value (could change this to be count = n)

And you can check the data using

tester_1 = function(df){

ind = random.sample(df)

data = data.frame(df[ind == 2,])

a = data.frame(colSums(data))

}

tester_1(df)

b = replicate(20, tester_1(df))

c = do.call(cbind, b) %>% as.data.frame

str(c)

d<-apply(c,2,min)

table(d)

I know I only checked half of the data but there were errors already indicating that something is up with my original coding..probably the random.sampling

Any help greatly appreciated.

I have tagged random forest here because this was a problem when looping though several training data.frames (I wanted to see if I had randomly chosen a poor test data set through some randomisation & comparison of the OOD & predicted accuracy!)

Perhaps there is also more elegant solution where one can control where the random samples come from columnwise. Eg. If i wanted to save 20% of the rows for training, but for that to be 'representative' subset along the columns hence in each column I would have ~20% of the values..

相关阅读:
Top