问题描述:

I am very weak in regex and the regex I am using (found from internet) is only partially solving my problem. I need to add an anchor tag to a URL from text input using java. Here is my code:

String text ="Hi please visit www.google.com";

String reg = "\\b(([\\w-]+://?|www[.])[^\\s()<>]+(?:\\([\\w\\d]+\\)|([^[:punct:]\\s]|/)))";

String s = text.replaceAll(reg, "<a href='$1'>$1</a>");

System.out.println(""+s);

The output currently is Hi please visit <a href='www.google.c'>www.google.c</a>om. What's wrong with the regex?

I need to parse a text and display a URL entered from text field as hot link in a jsp page. The actual output expected would be

Hi please visit <a href='www.google.com'>www.google.com</a>

Edit

Following regex

(http(s)?://)?(www(\.\w+)+[^\s.,"']*)

works like a charm in url ending with .com but fails in other extensions like .jsp.Is there any way for it to work in all sort of extension?

网友答案:

Java recognises posix expressions (see javadoc), but the syntax is a little different. It looks like this instead:

\p{Punct}

But I would simplify your regex for a URL to:

(?i)(http(s)?://)?((www(\.\w+)+|(\d{1,3}\.){3}\.\d{1,3})[^\s,"']*(?<!\\.))

And elaborate it only if you find a test case that breaks it.

As a java line it would be:

text = text.replaceAll("(?i)(http(s)?://)?((www(\\.\w+)+|(\\d{1,3}\\.){3}\\d{1,3})[^\\s,\"']*(?<!\\.))", "<a href=\"http$2://$3\">$3</a>");

Note the neat capture of the "s" in "https" (if found) that is restored if required.

网友答案:

To answer your question why the regex doesn't work: It doesn't observe Java's regex syntax rules.

Specifically:

[^[:punct:]\s]

doesn't work as you expect it to because Java doesn't recognize POSIX shorthands like [:punct:]. Instead, it treats that as a nested character class. That again leads to the ^ becoming illegal in that context, so Java ignores it, leaving you with a character class that matches the same as

[:punct\s]

which only matches the c of com, therefore ending your match there.

As for your question of how to find URLs in a block of text, I suggest you read Jan Goyvaert's excellent blog entry Detecting URLs in a block of text. You'll need to decide yourself how sensitive and how specific you want to make your regex.

For example, the solution proposed at the end of the post would translate to Java as

String resultString = subjectString.replaceAll(
    "(?imx)\\b(?:(?:https?|ftp|file)://|www\\.|ftp\\.)\n" +
    "(?:\\([-A-Z0-9+&@\\#/%=~_|$?!:,.]*\\)|\n" +
    "      [-A-Z0-9+&@\\#/%=~_|$?!:,.])*\n" +
    "(?:\\([-A-Z0-9+&@\\#/%=~_|$?!:,.]*\\)|\n" +
    "      [A-Z0-9+&@\\#/%=~_|$])", "<a href=\"$0\">$0</a>");
相关阅读:
Top