问题描述:

Brackets are marked as START_PUNCTUATION or END_PUNCTUATION character types in Java.

Having say "[", how is it possible to calculate "]" (without hardcoded table)?

网友答案:

Given the assumption "every START_PUNCTUATION has an equivalent END_PUNCTUATION after 1-3 code points, if it has one", which seems to be true, this snippet should list them for every character where it is possible:

public class EndPunct {
    private static final int UNICODE_MAX = Character.MAX_CODE_POINT;

    public static void main(String args[]) {
        for (int i = 0; i < UNICODE_MAX; i++) {
            if (!Character.isValidCodePoint(i)) {
                continue;
            }
            if (Character.getType(i) == Character.START_PUNCTUATION) {
                Character.UnicodeBlock currentBlock = Character.UnicodeBlock.of(i);
                boolean found = false;
                for (int newchar = i+1 ; newchar < Math.min(UNICODE_MAX, i+3); newchar++) {
                    if (!(Character.UnicodeBlock.of(newchar).equals(currentBlock))) {
                        break;
                    }
                    if (Character.getType(newchar) == Character.END_PUNCTUATION) {
                        System.out.println(toChar(i) + " matches " + toChar(newchar)
                                + " (codepoints u+" + Integer.toHexString(i) + " and u+" +Integer.toHexString(newchar) + ")");  
                        found = true;
                        break;
                    }
                }
                if (!found) {
                    System.out.println("NOT FOUND for " + toChar(i) + " [position u+" + Integer.toHexString(i) + "]");
                }
            }

        } 
    }
    public static String toChar(int codePoint) {
        return new String(Character.toChars(codePoint));
    }
}

From its output, you can see that this seems to work for other chars except for two:

( matches ) (codepoints u+28 and u+29)
[ matches ] (codepoints u+5b and u+5d)
{ matches } (codepoints u+7b and u+7d)
༺ matches ༻ (codepoints u+f3a and u+f3b)
༼ matches ༽ (codepoints u+f3c and u+f3d)
᚛ matches ᚜ (codepoints u+169b and u+169c)
NOT FOUND for ‚ [position u+201a]
NOT FOUND for „ [position u+201e]
⁅ matches ⁆ (codepoints u+2045 and u+2046)
⁽ matches ⁾ (codepoints u+207d and u+207e)
₍ matches ₎ (codepoints u+208d and u+208e)
〈 matches 〉 (codepoints u+2329 and u+232a)
❨ matches ❩ (codepoints u+2768 and u+2769)
❪ matches ❫ (codepoints u+276a and u+276b)
❬ matches ❭ (codepoints u+276c and u+276d)
❮ matches ❯ (codepoints u+276e and u+276f)
❰ matches ❱ (codepoints u+2770 and u+2771)
❲ matches ❳ (codepoints u+2772 and u+2773)
❴ matches ❵ (codepoints u+2774 and u+2775)
⟅ matches ⟆ (codepoints u+27c5 and u+27c6)
⟦ matches ⟧ (codepoints u+27e6 and u+27e7)
⟨ matches ⟩ (codepoints u+27e8 and u+27e9)
⟪ matches ⟫ (codepoints u+27ea and u+27eb)
⟬ matches ⟭ (codepoints u+27ec and u+27ed)
⟮ matches ⟯ (codepoints u+27ee and u+27ef)
⦃ matches ⦄ (codepoints u+2983 and u+2984)
⦅ matches ⦆ (codepoints u+2985 and u+2986)
⦇ matches ⦈ (codepoints u+2987 and u+2988)
⦉ matches ⦊ (codepoints u+2989 and u+298a)
⦋ matches ⦌ (codepoints u+298b and u+298c)
⦍ matches ⦎ (codepoints u+298d and u+298e)
⦏ matches ⦐ (codepoints u+298f and u+2990)
⦑ matches ⦒ (codepoints u+2991 and u+2992)
⦓ matches ⦔ (codepoints u+2993 and u+2994)
⦕ matches ⦖ (codepoints u+2995 and u+2996)
⦗ matches ⦘ (codepoints u+2997 and u+2998)
⧘ matches ⧙ (codepoints u+29d8 and u+29d9)
⧚ matches ⧛ (codepoints u+29da and u+29db)
⧼ matches ⧽ (codepoints u+29fc and u+29fd)
⸢ matches ⸣ (codepoints u+2e22 and u+2e23)
⸤ matches ⸥ (codepoints u+2e24 and u+2e25)
⸦ matches ⸧ (codepoints u+2e26 and u+2e27)
⸨ matches ⸩ (codepoints u+2e28 and u+2e29)
〈 matches 〉 (codepoints u+3008 and u+3009)
《 matches 》 (codepoints u+300a and u+300b)
「 matches 」 (codepoints u+300c and u+300d)
『 matches 』 (codepoints u+300e and u+300f)
【 matches 】 (codepoints u+3010 and u+3011)
〔 matches 〕 (codepoints u+3014 and u+3015)
〖 matches 〗 (codepoints u+3016 and u+3017)
〘 matches 〙 (codepoints u+3018 and u+3019)
〚 matches 〛 (codepoints u+301a and u+301b)
〝 matches 〞 (codepoints u+301d and u+301e)
﴾ matches ﴿ (codepoints u+fd3e and u+fd3f)
︗ matches ︘ (codepoints u+fe17 and u+fe18)
︵ matches ︶ (codepoints u+fe35 and u+fe36)
︷ matches ︸ (codepoints u+fe37 and u+fe38)
︹ matches ︺ (codepoints u+fe39 and u+fe3a)
︻ matches ︼ (codepoints u+fe3b and u+fe3c)
︽ matches ︾ (codepoints u+fe3d and u+fe3e)
︿ matches ﹀ (codepoints u+fe3f and u+fe40)
﹁ matches ﹂ (codepoints u+fe41 and u+fe42)
﹃ matches ﹄ (codepoints u+fe43 and u+fe44)
﹇ matches ﹈ (codepoints u+fe47 and u+fe48)
﹙ matches ﹚ (codepoints u+fe59 and u+fe5a)
﹛ matches ﹜ (codepoints u+fe5b and u+fe5c)
﹝ matches ﹞ (codepoints u+fe5d and u+fe5e)
( matches ) (codepoints u+ff08 and u+ff09)
[ matches ] (codepoints u+ff3b and u+ff3d)
{ matches } (codepoints u+ff5b and u+ff5d)
⦅ matches ⦆ (codepoints u+ff5f and u+ff60)
「 matches 」 (codepoints u+ff62 and u+ff63)

u+201a is single low quotation mark, and u+201e is double low quotation mark. For those, there is no matching character. For others, this method seems to work, so it seems to work for every one that has a match. However, there probably isn't any guarantee.

网友答案:

There is a character property called "bidi mirroring glyph", which gives you the mirror image for a character if one exists. This property is required for correct layout of bi-directional text: in right-to-left languages an open parentheses has to open to the left, so the text layout engine has to use the glyph for the close parenthesis ) instead of the glyph for the character originally in text.

Unfortunately the standard Java API does not give you access to the mirroring glyph property, but the ICU4J library does, using the UCharacter.getMirror method.

A "mostly correct" alternative is starting from the given opening character and checking if one of the next few characters is a closing punctuation character, and assuming it's the correct mirror image. Reading the mirroring data you can see that most of the time the mirrors are next to each other with very few exceptions (an example of an exception: U+2298 CIRCLED DIVISION SLASH is the mirror of U+29B8 CIRCLED REVERSE SOLIDUS - then again, these characters are not in the punctuation category)

相关阅读:
Top