User Tools

Site Tools


regex
snippet.juliarepl
julia> pkgchk( [ "julia" => v"1.0.2" ] )

Regular Expressions

The julia manual explains regexes in Strings. A good julia introduction to regex is in Regular_expressions.

Julia regexes are compliant with the Perl Compatible Regular Expressions library. Regexes can be very involved, because they can be very powerful. They are worthwhile learning if text processing is important. There are many perl regex tutorials and examples on the web, such as the Perl Regex Tutorial and the Regular expressions in a perl “cheatsheet”.

This chapter focuses on the use of regular expressions within Julia, rather than on explaining regular expressions themselves. It thus assumes that the reader already understands perl regexes.

In julia, regular expressions are of type Regex. A regex is almost like a string, but preceded with r when quoted (e.g., r"a."M [where M is an optional modified], or Regex("a.")). Regexes are internally compiled to speed up matching. (For search-replace, the substitution expression starts with a leading s as in s"\1".)

snippet.juliarepl
julia> typeof( r"x" ), typeof( s"x" )
(Regex, SubstitutionString{String})

By default, julia matches across multiple lines and replaces are global.

Is a Regex Valid?

snippet.juliarepl
julia> isregex(r::String)= try (Regex(r)!=nothing); catch; false; end#try#
isregex (generic function with 1 method)

julia> isregex.( ["^x", "((()" ])        ## first expression is valid regex, second is not
2-element BitArray{1}:
  true
 false

Testing Strings For Matches

snippet.juliarepl
julia> occursin( r"a[09]" , "a1 b2 a3 b4 a55 az A6 a7" )           ## at least one match
true

julia> occursin.( Ref(r"a[09]") , [ "ab", "  a9", "99", "   ac " ] )    ## note the function name's dot postfix.  Ref makes sure it is a call.
4-element BitArray{1}:
 false
  true
 false
 false

Warning: A third argument to occursin is the character offset.

Number of Matches

snippet.juliarepl
julia> matchall(r::Regex, s::AbstractString; overlap::Bool=false)= collect(( m.match for m=eachmatch(r,s,overlap=overlap) ));

julia> length(matchall( r"a([09])", "a1 b2 a3 b4 a55 az A6 a7" ))
4

Finding Matches in Strings

The n-th Match

snippet.juliarepl
julia> match( r"a([09])" , "a1 b2 a3 b4 a55 az A6 a7")          ## 1 is the (default) first match
RegexMatch("a1", 1="1")

julia> m= match( r"a([09])" , "a1 b2 a3 b4 a55 az A6 a7" , 2)   ## match# 2 requested
RegexMatch("a3", 1="3")

julia> match( r"nada" , "a1 b2 a3 b4 a55 az A6 a7") == nothing     ## no match found
true

Decoding Match Information

snippet.juliarepl
julia> matchall(r::Regex,s::AbstractString; overlap::Bool=false)= collect((m.match for m=eachmatch(r, s, overlap=overlap)));

julia> m= match( r"a([09])" , "a1 b2 a3 b4 a55 az A6 a7" , 2);  ## the *second* match

julia> m.match
"a3"

julia> m.offset
7

julia> m.captures                                                ## the captured content
1-element Array{Union{Nothing, SubString{String}},1}:
 "3"

julia> m.offsets                                                 ## start of captures, here ([09]) was 2
1-element Array{Int64,1}:
 8

julia> m.regex
r"a([09])"

Finding All Matches

snippet.juliarepl
julia> matchall(r::Regex,s::AbstractString; overlap::Bool=false)= collect((m.match for m=eachmatch(r, s, overlap=overlap)));

julia> match( r"a([09])" , "a1 b2 a3 b4 a55 az A6 a7" , 2)		## second match
RegexMatch("a3", 1="3")

julia> matchall( r"a([09])" , "a1 b2 a3 b4 a55 az A6 a7" )
4-element Array{SubString{String},1}:
 "a1"
 "a3"
 "a5"
 "a7"

To extract more information, use, e.g., m[2].offset, etc:

snippet.juliarepl
julia> matchall(r::Regex,s::AbstractString; overlap::Bool=false)= collect((m.match for m=eachmatch(r, s, overlap=overlap)));

julia> m= matchall( r"a([09])" , "a1 b2 a3 b4 a55 az A6 a7" );

julia> dump(m[2])
SubString{String}
  string: String "a1 b2 a3 b4 a55 az A6 a7"
  offset: Int64 6
  ncodeunits: Int64 2

Index Locations of all Matches'

snippet.juliarepl
julia> for i in eachmatch( r"a([09])", "a1 b2 a3 b4 a55 az A6 a7"); println(i.offset, ": ", i.match); end#for#
1: a1
7: a3
13: a5
23: a7

Matches With Line or Word Number

snippet.juliarepl
julia> for (n,l) in enumerate(split( "ab\ncd\nas\ncd\nmore\ncd\nend", "\n"))
           occursin( r".d" , l ) && println("L", n, ": ", l)
       end##for##
L2: cd
L4: cd
L6: cd
L7: end

Multiple Captures

snippet.juliarepl
julia> m= match( r"a.*?(\w)(\w)(\w) ", "no abcdef andmore azalias end" )  ## 3 captures requested
RegexMatch("abcdef ", 1="d", 2="e", 3="f")


julia> m.offset  ## start of match
4

julia> m.captures
3-element Array{Union{Nothing, SubString{String}},1}:
 "d"
 "e"
 "f"

julia> m.offsets
3-element Array{Int64,1}:
 7
 8
 9

Search-and-Replace in Strings

Unlike perl, julia's replace is global by default, but a trailing argument allows specifying the maximum number of replacements.

Plain Regex Search-and-Replace

If the third argument is an s"..."string substitution expression, it can use the captured text, as \1

snippet.juliarepl
julia> replace( "aabbccddaaabbbcccdddaabbccdd", r"(b+)" => s" [\1] ", count=2)
"aa [bb] ccddaaa [bbb] cccdddaabbccdd"

Naming alternatives are \g<0> or even named captures (r"(?<name>[a-z])", s"\g<name>").

Call Function Based on Match for Replacement

If the third argument is a (string) function, it is called with the matching text and its result replaces the matching text:

snippet.juliarepl
julia> replace( "aabbccddaaabbbcccdddaabbccdd", r"(b+)" => uppercase )
"aaBBccddaaaBBBcccdddaaBBccdd"

Example: Hex escaping some URL Characters

snippet.juliarepl
julia> hex(a::Char; kwargs...)= string( UInt32(a); base=16, kwargs... );

julia> replace(  "a & b & c" , r"[^a-z0–9\.]" => (s -> "%"*hex(s[1])))
"a%20%26%20b%20%26%20c"

Various Useful Patterns (Regexes)

Recursive Matching Parens

snippet.juliarepl
julia> matchall(r::Regex,s::AbstractString; overlap::Bool=false)= collect((m.match for m= eachmatch(r, s, overlap=overlap)));

julia> matchall( r"(\((?>[^()]|(?R))*\))",  "ab ( cd (ef) gh ) ( blah )")
2-element Array{SubString{String},1}:
 "( cd (ef) gh )"
 "( blah )"

Others

Purpose Regex Succeeds Fails
isalpha (only word characters) r"^\w*" “ab” “ab cd”
isnonalpha (only non-word characters r"^\W*" “ ” “ab cd”
isalnum (only alnum characters) r"^[\w0-9]*" “ab1” “ab1 cd”
integer r"^\s*[+-]?[0-9]+\s*$" “ +12” “12.2”
URL matching here
email matching here
leading/trailing spaces r"^\s*(.*)\s*$, s"\1"
matching parens r"(\((?>[^()]|(?R))*\))" “(ab (cd) ef) ( gh )” “(ab ”

Backmatter

Useful Packages on Julia Repository

Notes

References

regex.txt · Last modified: 2018/11/22 20:48 (external edit)