问题描述:

I am looking to parse through a dataset and match it up with a tree I have already made in R. I am trying to simplify the tip labels to be matched up with my phylogenetic tree.

For instance from the "gi|399148998|gb|JN638572|" and simplifying that down to just "JN638572" (the accession number); and I need to do this 61 times (61 samples). Each of the accession numbers start at the same position as well.

`## thanks for the data serban`

set.seed(1)

mydat <- replicate(61, paste0(paste0(sample(letters,2), collapse=""),"|",

round(runif(1,1e8,1e9-1)),"|",

paste0(sample(letters,2), collapse=""),"|",

paste0(sample(LETTERS,2), collapse=""),

round(runif(1,1e6,1e7-1)),"|"))

head(mydat)

# [1] "gj|615568026|xf|XZ6947179|" "qb|285377117|er|JT5479293|" "sy|442031661|ux|FQ2129996|"

# [4] "gj|112051300|jv|IM6396092|" "me|844635986|rt|CS4701469|" "vq|804639485|on|UA5295070|"

I would recommend against using for loops in R when you can avoid it. R can perform whole-vector operations. For your particular instance, this ought to do it:

```
library(stringr)
#Generate some data:
mydat <- replicate(61, paste0(paste0(sample(letters,2), collapse=""),"|",
round(runif(1,1e8,1e9-1)),"|",
paste0(sample(letters,2), collapse=""),"|",
paste0(sample(LETTERS,2), collapse=""),
round(runif(1,1e6,1e7-1)),"|"))
head(mydat)
[1] "pg|451576916|kj|FV9562908|" "dt|707843618|sj|KZ3658708|"
"lb|507989738|lc|ML2309736|" "nb|448725577|fo|DW1950100|"
[5] "iv|337265231|us|CR5163970|" "ew|254260770|rw|LB2404167|"
#Stuff you actually need:
results <- str_match(mydat, ".{2}\\|.*\\|.{2}\\|(.*)\\|")[,2]
#Results:
head(results)
[1] "FV9562908" "KZ3658708" "ML2309736" "DW1950100" "CR5163970" "LB2404167"
```

I am using regex, which stands for regular expressions. It would work with just `".*\\|(.*)\\|"`

due to "greedy" interpreters, but I've made it needlessly complicated to make it easier to explain `.{Nr}`

tells it to ignore `Nr`

characters, and `.*`

tells it to ignore as many characters as it takes to reach the next part of the pattern, namely `\\|`

. The `|`

is a special character and has to be "escaped" with `\\`

so that the regex processor can take it literally instead. The parentheses are the "Capture group", i.e. what you want returned.

`str_match`

is a function in the `stringr`

library (which you may have to install with `install.packages("stringr")`

), it returns in the first column the whole pattern, if a match is found, then the next column will be the first capture group. I'm returning the second column only by using the `[,2]`

notation.