Getting the most out of regular expressions

Doug Hoyte

Regexp::Debugger

What is the regexp engine actually doing? Use rxrx
Which parts of the regexp and input data are most expensive?

Demo: match

/[ab]{5}bbb/

Against

'aaaaaaaaaaaaaaaaQQQQQQQQQaaaaaaaaaaaaaaaQQQQQQQQQQQQ'

Press 'h' in the debugger to generate heat-maps

Regexp::Assemble - Demo

Assembles multiple regexps into a single regexp as a trie data-structure
To search for the following sequences
```
TTGATG  TTGGAC  TTCAAG  TTCAAC
    
```
The obvious regular expression is
```
(TTGATG|TTGGAC|TTCAAG|TTCAAC)
    
```
This is what Regexp::Assemble compiles
```
TT(G(ATG|GAC)|CAA[CG])
    
```
Demo

Regexp::Assemble - Implementation

But how do we know which sub-expression matched?
One of the ways perl's engine is special is that it can run perl code during the match

Regexp::Assemble uses the (?{ ... }) regexp directive:

TT(G(ATG(?{2})|GAC(?{3}))|CAA(C(?{0})|G(?{1})))

When reached by a regexp directive, the code in braces will be evaluated and the result will be stored in $^R (aka $LAST_REGEXP_CODE_RESULT)
Although here we're just returning integers, this could be arbitrary perl code

Regexp::Grammars

Awesome Damian module (so is Regexp::Debugger)
Adds recursive descent parsing to perl's regexp engine

Affects regexps defined lexically:

  {
    use Regexp::Grammars;

    ## regexps here will support grammars
  }

Regexp::Grammars - Parse email addresses

You know how you should never use a regexp to match an email address? Well now you can...
Abridged version of Tom Christiansen's RFC5322 parser:

    my $rfc5322 = qr{
      # Match this...
      <address>

      # As defined by these...
      <token: address>         <mailbox> | <group>
      <token: mailbox>         <name_addr> | <addr_spec>
      <token: name_addr>       <display_name>? <angle_addr>
      <token: angle_addr>      <CFWS>? \< <addr_spec> \> <CFWS>?
      <token: display_name>    <phrase>
      <token: mailbox_list>    <[mailbox]> ** (,)
      <token: addr_spec>       <local_part> \@ <domain>
      ...
    }x;

Regexp::Grammars - Demo

Nested parentheses are the classic example of something you can't do with a regexp
Demo: arith2lisp.pl is a program that parses infix arithmetic expressions and converts them into lisp-style prefix expressions

Regexp::Exhaustive

When you want to find all the occurrences of a regular expression in a string you can use m//g in perl ("findall" in python/javascript):
```
    @matches = "AAAA" =~ m/AA/g;
  
```
These are the matches it returns:
```
    AAAA
    --
      --
    
```
But what if you want all the possible matches?
```
    AAAA
    --
     --
      --
    
```

Regexp::Exhaustive - Implementation

Regexp::Exhaustive does this by inserting code that records a successful match and then artificially failing the match in order to invoke back-tracking

The m/AA/g regexp would be transformed into:

    AA(?{ record_match(); })(*FAIL)

Here is how to get all sub-strings from a string:

      say for exhaustive('abc' => qr/.+/);
      # abc
      # ab
      # a
      # bc
      # b
      # c

Regexp::Exhaustive - Factorisation

Crazy method to determine if an integer is prime invented by ABIGAIL:
```
sub is_prime {

  !( (1 x shift) =~ /^(..+)\1+$/ )

}
```

With Regexp::Exhaustive you can find all divisors:

sub divisors {

  map length, exhaustive((1 x shift) =>
                         qr/^(.+?)\1*$/)

}

Bio::Regexp

Specialised regexp language for biological data like DNA, RNA, and protein sequences
Exhaustive search, even for double-stranded and circular molecules

Bio::Regexp - IUPAC codes

Supports IUPAC abbreviations, which are the same idea as regexp character classes
Regexp character class: \w is short for [a-zA-Z0-9_]
IUPAC abbreviation: Y is short for [CT]
If you buy PCR primers and specify Y in the sequence, for that position 50% of the molecules will bind to a C and 50% to T

Bio::Regexp - Single pass scans

Because DNA is a double-helix and one strand corresponds to the other strand letter-by-letter, we also need to scan the reverse complement strand
Bio::Regexp can scan for your pattern(s) on the main strand and the reverse complement strand in a single pass so you don't have to copy and reverse the strand (also improves memory locality)

Bio::Regexp - Circular inputs

Normally no input data is copied at all except for circular molecules
With circular molecules, we only need to copy this amount to see if any matches span the arbitrary location chosen to be the "start"/"end"

Unicode Properties

The greek alphabet doesn't round-trip through case conversions (final sigma form is lost):

$ perl -CAS -E 'use utf8; say lc(uc("Θησέας"))'
θησέασ

But we can use a simple regexp to correct this:

sub greekify { $_[0] =~ s/(?<=\p{Greek})σ\b/ς/gr }

This puts back the final sigma form in greek text:

greekify(lc(uc("Θησέας")))
=> Θησέας

Without breaking other uses of sigma:

greekify(lc(uc("The sample is 5σ from the mean")))
the sample is 5σ from the mean

Text::Unidecode

Russian: Good-bye

unidecode("До свидания")
  => Do svidaniia

Thai: Tom Yum (the soup)

unidecode("ต้มยำ")
  => tmyam

Arabic: Good morning ("Sabah el kheer")

unidecode("صباح الخير")
  => SbH lkhyr

With PerlIO::via::Unidecode can do

open(my $fh, "<:encoding(utf8):via(Unidecode)", $file)

Annotate_unidecode

sub annotate_unidecode {
  my $val = shift;

  $val =~ s{ ( \P{ASCII} \P{Latin}* ) }
           {
             my $match = $1;
             "$match (" . unidecode($match) . ")"
           }egx;

  return $val;
}

Useful for form data (and mixed-script text):

annotate_unidecode("Name: 许勤   City: 深圳")'

  => Name: 许勤 (Xu Qin)   City: 深圳 (Shen Zhen)

Bi-Directional Writing

Perl has bi-directional language support (such as Text::Bidi and Text::WrapI18N)
In pre-hellenistic greek, it was common to write in alternating directions: Boustrophedon ("as the ox turns")

Text::Boustrophedon

Most useless module ever, of course done with crazy regexp hacks
😁 😁 😁

$ perl -CAS -MText::Boustrophedon \
       -E 'undef $/; say Text::Boustrophedon::greek(<>)' \
       < input.txt

Achilles glared at him and answered, "Fool, prate not 
       ƨƚnɒnɘvoɔ on ɘd nɒɔ ɘɿɘʜT .ƨƚnɒnɘvoɔ ƚuodɒ ɘm oƚ
between lions and men, wolves and lambs can never be of
     .ʜǫuoɿʜƚ bnɒ ʜǫuoɿʜƚ ɿɘʜƚo ʜɔɒɘ ɘƚɒʜ ƚud ,bnim ɘno
Therefore there can be no understanding between you and
ɘno lliƚ ,ƨu nɘɘwƚɘd ƨƚnɒnɘvoɔ ynɒ ɘd ɘɿɘʜƚ yɒm ɿon ,ɘm
or other shall fall and glut grim Ares with his life's 
                                                ."boold

Re-Entrant Regexp Engine

Re-entrancy is when a function is invoked while already inside that function
Different than thread-safe (compare threads vs unix signals)

Before perl 5.14, the engine was not fully re-entrant:

$ perl -e 'print "$]\n"'
5.008009
$ perl -e '"X" =~ m{(??{ "X" =~ m{} })}'
Segmentation fault

Running code from inside the regexp engine is quite useful. It's used by many of the modules discussed: Regexp::Grammars, Regexp::Assemble, Regexp::Exhaustive, Bio::Regexp
Real perl code needs to use regexps

The C curse

Historically perl strings have always tried to terminate strings with nul bytes like C
However, perl also lets you store nul bytes in your strings so length is stored separately
Interacting with unix things like path-names is more efficient if the string is known to be nul-terminated and to not contain nul-bytes
Before perl 5.18, the regexp engine would sometimes read in the extra, optional nul byte which is normally OK except...

Virtual memory page permissions

It is common to use perl strings to point to memory not managed by perl
For example with memory mapped files -- However the OS can map the file anywhere in memory and the adjacent pages aren't guaranteed to be readable
If the file mapped is an exact multiple of VM page size, and your OS has chosen a random address for the mapping, and you run certain regexps against the string, then segfault 💣

Gap pages

To demonstrate this, nothing special is needed on OpenBSD: mappings are random and separated by gap pages
If your OS allocates them in decreasing, adjacent addresses (ie linux without PaX patches), you can create a gap page yourself:

use File::Map qw/map_file protect PROT_NONE/;

map_file(my $gap_page, "/dev/zero", "<", 0, 4096);
protect($gap_page, PROT_NONE);

map_file(my $string, "/dev/zero", "<", 0, 4096);

$string =~ /.$/; ## boom

Questions?

This presentation:

http://hoytech.github.io/regexp-presentation/

Regexp::Debugger — Regexp::Grammars
Regexp::Assemble — Regexp::Exhaustive
Bio::Regexp — Text::Boustrophedon
Text::Bidi — Text::WrapI18N
Text::Unidecode — PerlIO::via::Unidecode