Quantcast
Channel: VBForums - CodeBank - Visual Basic 6 and earlier
Viewing all articles
Browse latest Browse all 1528

String functions for VB6 to handle BSTR strings with surrogate-pairs

$
0
0
Ok, I just knocked this out today. I've now tested fairly well, including edge conditions, so I'm done unless someone finds a bug.

Others are also certainly willing to test and report any problems.

For changes/updates, see "Version" comments in the code.

It'll be curious if anyone actually uses this thing. But it sure creates some discussion when this UCS-2 vs UTF-16 issue comes up.

For a BAS module:
Code:


Option Explicit
'
'  Version 1.00    ' Posted on VBForums CodeBank.
'  Version 1.01    ' Fixed LeftEx edge condition.
'  Version 1.02    ' Fixed math problem in MidEx.
'  Version 1.03    ' Fixed iStart default in InstrEx.
'  Version 1.04    ' Fixed logic problem in InstrEx and InstrRevEx.
'
' Some explanation:
'
'  The UTF-16 character set encoding is made up of the following:
'      The UCS-2 characters, which are always 2-bytes.
'      Surrogate-pair characters, which are always 4-bytes.
'
'  If a character is a surrogate-pair:
'      The low-order-word  is always in the range of &HDC00 to &HDFFF.
'      The high-order-word is always in the range of &HD800 to &HDBFF.
'
'  To avoid any possible confusion, if a character is not a surrogate-pair,
'  it can't be anywhere in the range between &HD800 and &HDFFF.
'  That's part of the UTF-16 specifications.
'
'  The built-in VB6 functions always assume the characters are UCS-2
'  characters, i.e., 2-bytes long.  Therefore, we need a special set
'  of functions to deal with strings that may contain surrogate-pairs.
'
'  Just as a note, the above does provide an opportunity for "garbage"
'  to be in a string.  For instance, if a word in a string is in the
'  range of >=&HD800 And <=&HDBFF, but the next word isn't in the
'  range of >=&HDC00 And <=&HDFFF, this would be garbage.  In the same
'  vane, if a word is in the range of >=&HDC00 And <=&HDFFF, but the
'  prior word isn't in the range of >=&HD800 And <=&HDBFF, this would
'  be garbage as well.  And the following don't check for such garbage,
'  and may return inaccurate results if a string has such garbage.

'
'  Functions reworked:
'      AscWEx
'      ChrWEx
'      InStrEx
'      InStrRevEx
'      LeftEx
'      LenEx
'      MidEx
'      RightEx
'
'      Split and Join should work just fine as they are.
'
'  Extra "helper" functions (that can be used by anyone):
'      IsUcs2Char
'      IsLowSurrogate
'      IsHighSurrogate
'      IsSurrogatePair
'      HasSurrogatePair
'      SurrogatePairCount
'
Private Declare Function GetMem4 Lib "msvbvm60" (ByRef Source As Any, ByRef Dest As Any) As Long ' Always ignore the returned value, it's useless.
Private Declare Function GetMem2 Lib "msvbvm60" (ByRef Source As Any, ByRef Dest As Any) As Long ' Always ignore the returned value, it's useless.
'


Public Function AscWEx(sChar As String) As Long
    ' Returns double-word, so surrogate-pairs can be represented.
    ' Examines only the first character of the string.
    '
    If IsSurrogatePair(sChar) Then  ' Words are swapped to accomodate Little Endian (LE).
        GetMem2 ByVal StrPtr(sChar) + 2&, AscWEx
        GetMem2 ByVal StrPtr(sChar), ByVal VarPtr(AscWEx) + 2&
    Else
        GetMem2 ByVal StrPtr(sChar), AscWEx
    End If
End Function

Public Function ChrWEx(ByVal iChar As Long) As String
    ' Ok, let's check for a bit of garbage.
    If iChar >= &HDC000000 And iChar <= &HDFFFFFFF Then Err.Raise 5&
    ' Now, let's just decide if we're dealing with a surrogate-pair or not.
    If iChar >= &HD8000000 And iChar <= &HDBFFFFFF Then
        ChrWEx = "  "
        GetMem2 iChar, ByVal StrPtr(ChrWEx) + 2&                ' Still must deal with LE going back in.
        GetMem2 ByVal VarPtr(iChar) + 2&, ByVal StrPtr(ChrWEx)
    Else
        ' If it's not a surrogate pair, we're going to ignore the high word.
        ChrWEx = ChrW$(CInt(iChar And &HFFFF))
    End If
End Function

Public Function InStrEx(sHay As String, sNeedle As String, Optional ByVal iStart As Long = 1&) As Long
    ' The optional iStart is at the end, as VB6 doesn't provide the kind of overloading the InStr() function does.
    ' All the InStrEx searches as done as vbBinaryCompare (the Instr() default) as that's the only want that really makes sense when searching for surrogate-pairs.
    ' iStart counts surrogate-pair characters only once, to respect them as characters.
    '
    ' The only real issue with this one (as opposed to Instr) is correctly handling iStart.
    '
    If iStart < 1& Then Err.Raise 5&        ' Same way Instr() handles it.
    Dim iPreCnt As Long
    iPreCnt = SurrogatePairCount(LeftEx(sHay, iStart - 1&))
    iStart = iStart + iPreCnt
    InStrEx = InStr(iStart, sHay, sNeedle, vbBinaryCompare) - iPreCnt
End Function

Public Function InStrRevEx(sHay As String, sNeedle As String, Optional ByVal iStart As Long = -1&) As Long
    ' All the InStrEx searches as done as vbBinaryCompare (the InstrRev() default) as that's the only want that really makes sense when searching for surrogate-pairs.
    ' iStart counts surrogate-pair characters only once, to respect them as characters.
    '
    Dim iPairCount As Long
    iPairCount = SurrogatePairCount(sHay)                      ' We'll need this a couple of times.
    Dim iLenEx As Long
    iLenEx = Len(sHay) - iPairCount                            ' We'll need this a couple of times.
    '
    If iStart = -1& Then iStart = iLenEx
    If iStart < 1& Then Err.Raise 5&                            ' Same way InStrRev() handles it.
    If iLenEx < iStart Then Exit Function                      ' Same way InStrRev() does it, even though it doesn't really make sense.
    '
    iStart = iStart + SurrogatePairCount(LeftEx(sHay, iStart))  ' Make it a no-surrogate-pair version, so we can use it in InStrRev().
    InStrRevEx = InStrRev(sHay, sNeedle, iStart)                ' But we haven't correctly handled surrogate-pairs, yet.
    If InStrRevEx > 1& Then                                    ' If it's one, we're good to go, either way.
        InStrRevEx = InStrRevEx - SurrogatePairCount(Left$(sHay, InStrRevEx - 1&))
    End If                                                      ' Above, adjust for surrogate pairs prior to our "find".
End Function

Public Function LeftEx(sStr As String, ByVal iLength As Long) As String
    ' We assume that iLength is characters, including surrogate-pairs (counted once each).
    '
    If iLength = 0& Then Exit Function                          ' Easy.
    If iLength < 0& Then Err.Raise 5&                          ' Same way Left$() handles it.
    '
    LeftEx = Left$(sStr, iLength + SurrogatePairCount(sStr))    ' Start by assuming they're all in the piece we want.
    Do                                                          ' Loop until we've trimmed to correct length.
        If LenEx(LeftEx) <= iLength Then Exit Function          ' Return when we've got the correct length (or asked for more than there are).
        '                                                      ' This test works even if there aren't surrogate-pairs in the tested piece.
        If IsLowSurrogate(Right$(LeftEx, 1&)) Then              ' Is the right-most word a high of a surrogate-pair?
            LeftEx = Left$(LeftEx, Len(LeftEx) - 2&)            ' Trim surrogate-pair.
        Else
            LeftEx = Left$(LeftEx, Len(LeftEx) - 1&)            ' Trim UCS-2 character.
        End If
    Loop
End Function

Public Function LenEx(sStr As String) As Long
    LenEx = Len(sStr) - SurrogatePairCount(sStr)
End Function

Public Function MidEx(sStr As String, ByVal iStart As Long, Optional ByVal iLength As Long) As String
    ' We assume that iStart and iLength is characters, including surrogate-pairs (counted once each).
    '
    If iStart < 1& Then Err.Raise 5&                            ' Same way Mid$() handles it.
    iStart = iStart - 1&                                        ' Make iStart 0 based, it's just easier.
    If iLength < 0& Then Err.Raise 5&                          ' Same way Mid$() handles it.
    If iLength = 0& Then iLength = &H7FFFFFFF                  ' Just makes it easy.  We want all that's remaining.
    Dim iLenEx As Long
    iLenEx = LenEx(sStr)
    If iLength > iLenEx - iStart Then                          ' Adjust length to be exactly what we want.
        iLength = iLenEx - iStart
    End If
    If iLength <= 0& Then Exit Function                        ' Return empty string, same way Mid$() does it.
    MidEx = LeftEx(RightEx(sStr, iLenEx - iStart), iLength)
End Function

Public Function RightEx(sStr As String, ByVal iLength As Long) As String
    ' We assume that iLength is characters, including surrogate-pairs (counted once each).
    '
    If iLength = 0& Then Exit Function                          ' Easy.
    If iLength < 0& Then Err.Raise 5&                          ' Same way Right$() handles it.
    '
    RightEx = Right$(sStr, iLength + SurrogatePairCount(sStr))  ' Start by assuming they're all in the piece we want.
    Do                                                          ' Loop until we've trimmed to correct length.
        If LenEx(RightEx) <= iLength Then Exit Function        ' Return when we've got the correct length (or asked for more than there are).
        '                                                      ' This test works even if there aren't surrogate-pairs in the tested piece.
        If IsSurrogatePair(RightEx) Then                        ' Is the left-most character a surrogate-pair?
            RightEx = Right$(RightEx, Len(RightEx) - 2&)        ' Trim surrogate-pair.
        Else
            RightEx = Right$(RightEx, Len(RightEx) - 1&)        ' Trim UCS-2 character.
        End If
    Loop
End Function



Public Function IsUcs2Char(sChar As String) As Boolean
    ' Only tests the first character of sChar.
    ' Just say "Not IsUcs2Char" to see if a word of a surrogate-pair.
    If Len(sChar) = 0& Then Exit Function
    Dim i As Integer:  i = AscW(sChar)
    IsUcs2Char = i < &HD800 Or i > &HDFFF
End Function

Public Function IsLowSurrogate(sChar As String) As Boolean
    ' Only tests the first character of sChar.
    If Len(sChar) = 0& Then Exit Function
    Dim i As Integer:  i = AscW(sChar)
    IsLowSurrogate = i >= &HDC00 And i <= &HDFFF
End Function

Public Function IsHighSurrogate(sChar As String) As Boolean
    ' Only tests the first character of sChar.
    If Len(sChar) = 0& Then Exit Function
    Dim i As Integer:  i = AscW(sChar)
    IsHighSurrogate = i >= &HD800 And i <= &HDBFF
End Function

Public Function IsSurrogatePair(sChar As String) As Boolean
    ' Looks precisely at the first FOUR bytes in sChar.
    If Len(sChar) = 0& Then Exit Function
    Static i(1&) As Integer
    GetMem4 ByVal StrPtr(sChar), i(0&)  ' This is safe because, if we're at the last character, we'll just get the null terminator.
    IsSurrogatePair = i(1&) >= &HDC00 And i(1&) <= &HDFFF And i(0&) >= &HD800 And i(0&) <= &HDBFF
End Function

Public Function HasSurrogatePair(sStr As String) As Boolean
    Dim bb() As Byte:  bb = sStr
    Dim i As Long
    For i = 1& To UBound(bb) Step 2&    ' We're looking only at the high-bytes.
        If bb(i) >= &HD8 And bb(i) <= &HDB Then
            If i + 2& <= UBound(bb) Then
                HasSurrogatePair = bb(i + 2&) >= &HDC And bb(i + 2&) <= &HDF
            End If
            Exit Function      ' If we return False here, the string has garbage.
        End If
    Next
End Function

Public Function SurrogatePairCount(sStr As String) As Long
    Dim bb() As Byte:  bb = sStr
    Dim i As Long
    For i = 1& To UBound(bb) Step 2&    ' We're looking only at the high-bytes.
        If bb(i) >= &HD8 And bb(i) <= &HDB Then
            If i + 2& <= UBound(bb) Then
                If bb(i + 2&) >= &HDC And bb(i + 2&) <= &HDF Then
                    SurrogatePairCount = SurrogatePairCount + 1&
                    i = i + 2&
                End If
            End If
        End If
    Next
End Function


There's some testing code in the next post.

Enjoy,
Elroy

Viewing all articles
Browse latest Browse all 1528

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>