Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String.split documentation is misleading #46280

Closed
jamesderlin opened this issue Jun 7, 2021 · 2 comments
Closed

String.split documentation is misleading #46280

jamesderlin opened this issue Jun 7, 2021 · 2 comments
Labels
area-core-library SDK core library issues (core, async, ...); use area-vm or area-web for platform specific libraries. library-core

Comments

@jamesderlin
Copy link
Contributor

jamesderlin commented Jun 7, 2021

https://api.dart.dev/stable/2.13.2/dart-core/String/split.html states:

Empty matches at the beginning and end of the strings are ignored, and so are empty matches right after another match.

var string = "abba";
// Matches:   ^^ ^^
string.split(RegExp(r"b*"));        // ['a', 'a']
                                        // not ['', 'a', 'a', '']
                                        // not ['a', '', 'a']
  1. I don't understand the usage of "matches" in the above statement. The search string being matched is where splits should occur, but "empty matches ... are ignored" and the example seem to imply that it's talking about the resulting tokens, which is inconsistent verbiage.

  2. The Matches: ^^ ^^ doesn't make sense to me. Is it supposed to be showing what's matched by the search pattern? If so, shouldn't it be pointing to just bb? Or if it's supposed to be pointing to the resulting tokens, shouldn't it be pointing to the two as?

  3. The code behaves as described only because of the regular expression used. It is not generally true that empty matches are ignored. If we instead used string.split('b'), it would result in ['a', '', 'a'], or if we used string.split('a'), we'd get ['', 'bb', ''].

@devoncarew devoncarew added area-core-library SDK core library issues (core, async, ...); use area-vm or area-web for platform specific libraries. library-core labels Jun 7, 2021
@jamesderlin
Copy link
Contributor Author

jamesderlin commented Jun 7, 2021

Okay, I think I see now what it's trying to say. The RE b* is an empty match at the beginning and end of the string, so this part of the documentation is using "matches" consistently with the rest. I misunderstood what the example was trying to demonstrate. The example also maybe should be splitting something like 'aabbaa' instead (which produces ['a', 'a', 'a', 'a']).

I'm still confused by point 2 and think that the documentation should clearly state that consecutive matches of the search pattern aren't automatically coalesced (and therefore can result in empty tokens in the result).

@lrhn
Copy link
Member

lrhn commented Jun 8, 2021

More documentation and more examples are probably a good idea. The behavior is complex and slightly inconsistent (leading/trailing empty matches are ignored).

An example could be prose like:

The string "abba" contains four matches of RegExp("b*"): An empty match before the first a, a match of bb,
an empty match after bb and before a, and and an empty match after the last a (see [RegExp.allMatches]).
The split method ignores empty matches at the start and end of the input string, as well as right after another match,
so only the match of bb is used for splitting. The result is therefore ["a", "a"].

If a non-empty match immediately follows another match, the two are not combined, and the result will contain the
empty string between the two matches.
Also, an empty match followed by a non-empty match at the same position are treated as two matches.
That's not something which can occur naturally from a [String] or [RegExp] pattern, it requires a custom
written [Pattern] implementation which can somehow produce different matches at the same point of the string.

The ^ notation isn't particularly readable. (Seemed like a great idea at the time!)

Another option is to mark the actual matches as []a[bb][]a[], but it will still need prose to exaplain.

The behavior is not because it's a RegExp, it's general behavior for empty matches. If you do "abba".split(""), you get ["a", "b", "b", "a"]. We still ignore leading and trailing empty matches.
We don't need to ignore empty matches after another match because our String.allMatches deliberately avoids those. If we write a Pattern implementation which doesn't (like RegExp, but not necessarily being RegExp) the rule still counts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-core-library SDK core library issues (core, async, ...); use area-vm or area-web for platform specific libraries. library-core
Projects
None yet
Development

No branches or pull requests

3 participants