Ok, I just knocked this out today. I've now tested fairly well, including edge conditions, so I'm done unless someone finds a bug.
Others are also certainly willing to test and report any problems.
For changes/updates, see "Version" comments in the code.
It'll be curious if anyone actually uses this thing. But it sure creates some discussion when this UCS-2 vs UTF-16 issue comes up.
For a BAS module:
There's some testing code in the next post.
Enjoy,
Elroy
Others are also certainly willing to test and report any problems.
For changes/updates, see "Version" comments in the code.
It'll be curious if anyone actually uses this thing. But it sure creates some discussion when this UCS-2 vs UTF-16 issue comes up.
For a BAS module:
Code:
Option Explicit
'
' Version 1.00 ' Posted on VBForums CodeBank.
' Version 1.01 ' Fixed LeftEx edge condition.
' Version 1.02 ' Fixed math problem in MidEx.
' Version 1.03 ' Fixed iStart default in InstrEx.
' Version 1.04 ' Fixed logic problem in InstrEx and InstrRevEx.
'
' Some explanation:
'
' The UTF-16 character set encoding is made up of the following:
' The UCS-2 characters, which are always 2-bytes.
' Surrogate-pair characters, which are always 4-bytes.
'
' If a character is a surrogate-pair:
' The low-order-word is always in the range of &HDC00 to &HDFFF.
' The high-order-word is always in the range of &HD800 to &HDBFF.
'
' To avoid any possible confusion, if a character is not a surrogate-pair,
' it can't be anywhere in the range between &HD800 and &HDFFF.
' That's part of the UTF-16 specifications.
'
' The built-in VB6 functions always assume the characters are UCS-2
' characters, i.e., 2-bytes long. Therefore, we need a special set
' of functions to deal with strings that may contain surrogate-pairs.
'
' Just as a note, the above does provide an opportunity for "garbage"
' to be in a string. For instance, if a word in a string is in the
' range of >=&HD800 And <=&HDBFF, but the next word isn't in the
' range of >=&HDC00 And <=&HDFFF, this would be garbage. In the same
' vane, if a word is in the range of >=&HDC00 And <=&HDFFF, but the
' prior word isn't in the range of >=&HD800 And <=&HDBFF, this would
' be garbage as well. And the following don't check for such garbage,
' and may return inaccurate results if a string has such garbage.
'
' Functions reworked:
' AscWEx
' ChrWEx
' InStrEx
' InStrRevEx
' LeftEx
' LenEx
' MidEx
' RightEx
'
' Split and Join should work just fine as they are.
'
' Extra "helper" functions (that can be used by anyone):
' IsUcs2Char
' IsLowSurrogate
' IsHighSurrogate
' IsSurrogatePair
' HasSurrogatePair
' SurrogatePairCount
'
Private Declare Function GetMem4 Lib "msvbvm60" (ByRef Source As Any, ByRef Dest As Any) As Long ' Always ignore the returned value, it's useless.
Private Declare Function GetMem2 Lib "msvbvm60" (ByRef Source As Any, ByRef Dest As Any) As Long ' Always ignore the returned value, it's useless.
'
Public Function AscWEx(sChar As String) As Long
' Returns double-word, so surrogate-pairs can be represented.
' Examines only the first character of the string.
'
If IsSurrogatePair(sChar) Then ' Words are swapped to accomodate Little Endian (LE).
GetMem2 ByVal StrPtr(sChar) + 2&, AscWEx
GetMem2 ByVal StrPtr(sChar), ByVal VarPtr(AscWEx) + 2&
Else
GetMem2 ByVal StrPtr(sChar), AscWEx
End If
End Function
Public Function ChrWEx(ByVal iChar As Long) As String
' Ok, let's check for a bit of garbage.
If iChar >= &HDC000000 And iChar <= &HDFFFFFFF Then Err.Raise 5&
' Now, let's just decide if we're dealing with a surrogate-pair or not.
If iChar >= &HD8000000 And iChar <= &HDBFFFFFF Then
ChrWEx = " "
GetMem2 iChar, ByVal StrPtr(ChrWEx) + 2& ' Still must deal with LE going back in.
GetMem2 ByVal VarPtr(iChar) + 2&, ByVal StrPtr(ChrWEx)
Else
' If it's not a surrogate pair, we're going to ignore the high word.
ChrWEx = ChrW$(CInt(iChar And &HFFFF))
End If
End Function
Public Function InStrEx(sHay As String, sNeedle As String, Optional ByVal iStart As Long = 1&) As Long
' The optional iStart is at the end, as VB6 doesn't provide the kind of overloading the InStr() function does.
' All the InStrEx searches as done as vbBinaryCompare (the Instr() default) as that's the only want that really makes sense when searching for surrogate-pairs.
' iStart counts surrogate-pair characters only once, to respect them as characters.
'
' The only real issue with this one (as opposed to Instr) is correctly handling iStart.
'
If iStart < 1& Then Err.Raise 5& ' Same way Instr() handles it.
Dim iPreCnt As Long
iPreCnt = SurrogatePairCount(LeftEx(sHay, iStart - 1&))
iStart = iStart + iPreCnt
InStrEx = InStr(iStart, sHay, sNeedle, vbBinaryCompare) - iPreCnt
End Function
Public Function InStrRevEx(sHay As String, sNeedle As String, Optional ByVal iStart As Long = -1&) As Long
' All the InStrEx searches as done as vbBinaryCompare (the InstrRev() default) as that's the only want that really makes sense when searching for surrogate-pairs.
' iStart counts surrogate-pair characters only once, to respect them as characters.
'
Dim iPairCount As Long
iPairCount = SurrogatePairCount(sHay) ' We'll need this a couple of times.
Dim iLenEx As Long
iLenEx = Len(sHay) - iPairCount ' We'll need this a couple of times.
'
If iStart = -1& Then iStart = iLenEx
If iStart < 1& Then Err.Raise 5& ' Same way InStrRev() handles it.
If iLenEx < iStart Then Exit Function ' Same way InStrRev() does it, even though it doesn't really make sense.
'
iStart = iStart + SurrogatePairCount(LeftEx(sHay, iStart)) ' Make it a no-surrogate-pair version, so we can use it in InStrRev().
InStrRevEx = InStrRev(sHay, sNeedle, iStart) ' But we haven't correctly handled surrogate-pairs, yet.
If InStrRevEx > 1& Then ' If it's one, we're good to go, either way.
InStrRevEx = InStrRevEx - SurrogatePairCount(Left$(sHay, InStrRevEx - 1&))
End If ' Above, adjust for surrogate pairs prior to our "find".
End Function
Public Function LeftEx(sStr As String, ByVal iLength As Long) As String
' We assume that iLength is characters, including surrogate-pairs (counted once each).
'
If iLength = 0& Then Exit Function ' Easy.
If iLength < 0& Then Err.Raise 5& ' Same way Left$() handles it.
'
LeftEx = Left$(sStr, iLength + SurrogatePairCount(sStr)) ' Start by assuming they're all in the piece we want.
Do ' Loop until we've trimmed to correct length.
If LenEx(LeftEx) <= iLength Then Exit Function ' Return when we've got the correct length (or asked for more than there are).
' ' This test works even if there aren't surrogate-pairs in the tested piece.
If IsLowSurrogate(Right$(LeftEx, 1&)) Then ' Is the right-most word a high of a surrogate-pair?
LeftEx = Left$(LeftEx, Len(LeftEx) - 2&) ' Trim surrogate-pair.
Else
LeftEx = Left$(LeftEx, Len(LeftEx) - 1&) ' Trim UCS-2 character.
End If
Loop
End Function
Public Function LenEx(sStr As String) As Long
LenEx = Len(sStr) - SurrogatePairCount(sStr)
End Function
Public Function MidEx(sStr As String, ByVal iStart As Long, Optional ByVal iLength As Long) As String
' We assume that iStart and iLength is characters, including surrogate-pairs (counted once each).
'
If iStart < 1& Then Err.Raise 5& ' Same way Mid$() handles it.
iStart = iStart - 1& ' Make iStart 0 based, it's just easier.
If iLength < 0& Then Err.Raise 5& ' Same way Mid$() handles it.
If iLength = 0& Then iLength = &H7FFFFFFF ' Just makes it easy. We want all that's remaining.
Dim iLenEx As Long
iLenEx = LenEx(sStr)
If iLength > iLenEx - iStart Then ' Adjust length to be exactly what we want.
iLength = iLenEx - iStart
End If
If iLength <= 0& Then Exit Function ' Return empty string, same way Mid$() does it.
MidEx = LeftEx(RightEx(sStr, iLenEx - iStart), iLength)
End Function
Public Function RightEx(sStr As String, ByVal iLength As Long) As String
' We assume that iLength is characters, including surrogate-pairs (counted once each).
'
If iLength = 0& Then Exit Function ' Easy.
If iLength < 0& Then Err.Raise 5& ' Same way Right$() handles it.
'
RightEx = Right$(sStr, iLength + SurrogatePairCount(sStr)) ' Start by assuming they're all in the piece we want.
Do ' Loop until we've trimmed to correct length.
If LenEx(RightEx) <= iLength Then Exit Function ' Return when we've got the correct length (or asked for more than there are).
' ' This test works even if there aren't surrogate-pairs in the tested piece.
If IsSurrogatePair(RightEx) Then ' Is the left-most character a surrogate-pair?
RightEx = Right$(RightEx, Len(RightEx) - 2&) ' Trim surrogate-pair.
Else
RightEx = Right$(RightEx, Len(RightEx) - 1&) ' Trim UCS-2 character.
End If
Loop
End Function
Public Function IsUcs2Char(sChar As String) As Boolean
' Only tests the first character of sChar.
' Just say "Not IsUcs2Char" to see if a word of a surrogate-pair.
If Len(sChar) = 0& Then Exit Function
Dim i As Integer: i = AscW(sChar)
IsUcs2Char = i < &HD800 Or i > &HDFFF
End Function
Public Function IsLowSurrogate(sChar As String) As Boolean
' Only tests the first character of sChar.
If Len(sChar) = 0& Then Exit Function
Dim i As Integer: i = AscW(sChar)
IsLowSurrogate = i >= &HDC00 And i <= &HDFFF
End Function
Public Function IsHighSurrogate(sChar As String) As Boolean
' Only tests the first character of sChar.
If Len(sChar) = 0& Then Exit Function
Dim i As Integer: i = AscW(sChar)
IsHighSurrogate = i >= &HD800 And i <= &HDBFF
End Function
Public Function IsSurrogatePair(sChar As String) As Boolean
' Looks precisely at the first FOUR bytes in sChar.
If Len(sChar) = 0& Then Exit Function
Static i(1&) As Integer
GetMem4 ByVal StrPtr(sChar), i(0&) ' This is safe because, if we're at the last character, we'll just get the null terminator.
IsSurrogatePair = i(1&) >= &HDC00 And i(1&) <= &HDFFF And i(0&) >= &HD800 And i(0&) <= &HDBFF
End Function
Public Function HasSurrogatePair(sStr As String) As Boolean
Dim bb() As Byte: bb = sStr
Dim i As Long
For i = 1& To UBound(bb) Step 2& ' We're looking only at the high-bytes.
If bb(i) >= &HD8 And bb(i) <= &HDB Then
If i + 2& <= UBound(bb) Then
HasSurrogatePair = bb(i + 2&) >= &HDC And bb(i + 2&) <= &HDF
End If
Exit Function ' If we return False here, the string has garbage.
End If
Next
End Function
Public Function SurrogatePairCount(sStr As String) As Long
Dim bb() As Byte: bb = sStr
Dim i As Long
For i = 1& To UBound(bb) Step 2& ' We're looking only at the high-bytes.
If bb(i) >= &HD8 And bb(i) <= &HDB Then
If i + 2& <= UBound(bb) Then
If bb(i + 2&) >= &HDC And bb(i + 2&) <= &HDF Then
SurrogatePairCount = SurrogatePairCount + 1&
i = i + 2&
End If
End If
End If
Next
End Function
Enjoy,
Elroy