问题描述:

I am looking to parse through a dataset and match it up with a tree I have already made in R. I am trying to simplify the tip labels to be matched up with my phylogenetic tree.

For instance from the "gi|399148998|gb|JN638572|" and simplifying that down to just "JN638572" (the accession number); and I need to do this 61 times (61 samples). Each of the accession numbers start at the same position as well.

## thanks for the data serban

set.seed(1)

mydat <- replicate(61, paste0(paste0(sample(letters,2), collapse=""),"|",

round(runif(1,1e8,1e9-1)),"|",

paste0(sample(letters,2), collapse=""),"|",

paste0(sample(LETTERS,2), collapse=""),

round(runif(1,1e6,1e7-1)),"|"))

head(mydat)

# [1] "gj|615568026|xf|XZ6947179|" "qb|285377117|er|JT5479293|" "sy|442031661|ux|FQ2129996|"

# [4] "gj|112051300|jv|IM6396092|" "me|844635986|rt|CS4701469|" "vq|804639485|on|UA5295070|"

网友答案:

I would recommend against using for loops in R when you can avoid it. R can perform whole-vector operations. For your particular instance, this ought to do it:

 library(stringr)
 #Generate some data: 
 mydat <- replicate(61, paste0(paste0(sample(letters,2), collapse=""),"|",
                                round(runif(1,1e8,1e9-1)),"|",
                                paste0(sample(letters,2), collapse=""),"|",
                                paste0(sample(LETTERS,2), collapse=""),
                                round(runif(1,1e6,1e7-1)),"|"))
head(mydat)
[1] "pg|451576916|kj|FV9562908|" "dt|707843618|sj|KZ3658708|" 
    "lb|507989738|lc|ML2309736|" "nb|448725577|fo|DW1950100|"
[5] "iv|337265231|us|CR5163970|" "ew|254260770|rw|LB2404167|"
 #Stuff you actually need:     
 results <- str_match(mydat, ".{2}\\|.*\\|.{2}\\|(.*)\\|")[,2]
 #Results:
 head(results)  
 [1] "FV9562908" "KZ3658708" "ML2309736" "DW1950100" "CR5163970" "LB2404167"

I am using regex, which stands for regular expressions. It would work with just ".*\\|(.*)\\|" due to "greedy" interpreters, but I've made it needlessly complicated to make it easier to explain .{Nr} tells it to ignore Nr characters, and .* tells it to ignore as many characters as it takes to reach the next part of the pattern, namely \\|. The | is a special character and has to be "escaped" with \\ so that the regex processor can take it literally instead. The parentheses are the "Capture group", i.e. what you want returned.

str_match is a function in the stringr library (which you may have to install with install.packages("stringr")), it returns in the first column the whole pattern, if a match is found, then the next column will be the first capture group. I'm returning the second column only by using the [,2] notation.

相关阅读:
Top