Getting the most out of regular expressions


Doug Hoyte

Regexp::Debugger

Regexp::Assemble - Demo

Regexp::Assemble - Implementation

Regexp::Grammars

Regexp::Grammars - Parse email addresses

    my $rfc5322 = qr{
      # Match this...
      <address>

      # As defined by these...
      <token: address>         <mailbox> | <group>
      <token: mailbox>         <name_addr> | <addr_spec>
      <token: name_addr>       <display_name>? <angle_addr>
      <token: angle_addr>      <CFWS>? \< <addr_spec> \> <CFWS>?
      <token: display_name>    <phrase>
      <token: mailbox_list>    <[mailbox]> ** (,)
      <token: addr_spec>       <local_part> \@ <domain>
      ...
    }x;
  

Regexp::Grammars - Demo

Regexp::Exhaustive

Regexp::Exhaustive - Implementation

Regexp::Exhaustive - Factorisation

Bio::Regexp

Bio::Regexp - IUPAC codes

Bio::Regexp - Single pass scans

Bio::Regexp - Circular inputs

Unicode Properties

The greek alphabet doesn't round-trip through case conversions (final sigma form is lost):
$ perl -CAS -E 'use utf8; say lc(uc("Θησέας"))'
θησέασ
But we can use a simple regexp to correct this:
sub greekify { $_[0] =~ s/(?<=\p{Greek})σ\b/ς/gr }
This puts back the final sigma form in greek text:
greekify(lc(uc("Θησέας")))
=> Θησέας
Without breaking other uses of sigma:
greekify(lc(uc("The sample is 5σ from the mean")))
the sample is 5σ from the mean

Text::Unidecode

Russian: Good-bye
unidecode("До свидания")
  => Do svidaniia
Thai: Tom Yum (the soup)
unidecode("ต้มยำ")
  => tmyam
Arabic: Good morning ("Sabah el kheer")
unidecode("صباح الخير")
  => SbH lkhyr 

With PerlIO::via::Unidecode can do
open(my $fh, "<:encoding(utf8):via(Unidecode)", $file)

Annotate_unidecode

sub annotate_unidecode {
  my $val = shift;

  $val =~ s{ ( \P{ASCII} \P{Latin}* ) }
           {
             my $match = $1;
             "$match (" . unidecode($match) . ")"
           }egx;

  return $val;
}

Useful for form data (and mixed-script text):
annotate_unidecode("Name: 许勤   City: 深圳")'

  => Name: 许勤 (Xu Qin)   City: 深圳 (Shen Zhen)

Bi-Directional Writing

  • Perl has bi-directional language support (such as Text::Bidi and Text::WrapI18N)
  • In pre-hellenistic greek, it was common to write in alternating directions: Boustrophedon ("as the ox turns")

Text::Boustrophedon

Most useless module ever, of course done with crazy regexp hacks
😁 😁 😁

$ perl -CAS -MText::Boustrophedon \
       -E 'undef $/; say Text::Boustrophedon::greek(<>)' \
       < input.txt

Achilles glared at him and answered, "Fool, prate not 
       ƨƚnɒnɘvoɔ on ɘd nɒɔ ɘɿɘʜT .ƨƚnɒnɘvoɔ ƚuodɒ ɘm oƚ
between lions and men, wolves and lambs can never be of
     .ʜǫuoɿʜƚ bnɒ ʜǫuoɿʜƚ ɿɘʜƚo ʜɔɒɘ ɘƚɒʜ ƚud ,bnim ɘno
Therefore there can be no understanding between you and
ɘno lliƚ ,ƨu nɘɘwƚɘd ƨƚnɒnɘvoɔ ynɒ ɘd ɘɿɘʜƚ yɒm ɿon ,ɘm
or other shall fall and glut grim Ares with his life's 
                                                ."boold

Re-Entrant Regexp Engine

  • Re-entrancy is when a function is invoked while already inside that function
  • Different than thread-safe (compare threads vs unix signals)
  • Before perl 5.14, the engine was not fully re-entrant:
    $ perl -e 'print "$]\n"'
    5.008009
    $ perl -e '"X" =~ m{(??{ "X" =~ m{} })}'
    Segmentation fault
    
  • Running code from inside the regexp engine is quite useful. It's used by many of the modules discussed: Regexp::Grammars, Regexp::Assemble, Regexp::Exhaustive, Bio::Regexp
  • Real perl code needs to use regexps

The C curse

  • Historically perl strings have always tried to terminate strings with nul bytes like C
  • However, perl also lets you store nul bytes in your strings so length is stored separately
  • Interacting with unix things like path-names is more efficient if the string is known to be nul-terminated and to not contain nul-bytes
  • Before perl 5.18, the regexp engine would sometimes read in the extra, optional nul byte which is normally OK except...

Virtual memory page permissions

  • It is common to use perl strings to point to memory not managed by perl
  • For example with memory mapped files -- However the OS can map the file anywhere in memory and the adjacent pages aren't guaranteed to be readable
  • If the file mapped is an exact multiple of VM page size, and your OS has chosen a random address for the mapping, and you run certain regexps against the string, then segfault 💣

Gap pages

  • To demonstrate this, nothing special is needed on OpenBSD: mappings are random and separated by gap pages
  • If your OS allocates them in decreasing, adjacent addresses (ie linux without PaX patches), you can create a gap page yourself:

use File::Map qw/map_file protect PROT_NONE/;

map_file(my $gap_page, "/dev/zero", "<", 0, 4096);
protect($gap_page, PROT_NONE);

map_file(my $string, "/dev/zero", "<", 0, 4096);

$string =~ /.$/; ## boom