User Tools

Site Tools


regex

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

regex [2018/12/27 13:27] (current)
Line 1: Line 1:
 +
 +~~CLOSETOC~~
 +
 +~~TOC 1-3 wide~~
 +
 +
 +```juliarepl
 +julia> pkgchk.( [ "​julia"​ => v"​1.0.3"​ ] );
 +```
 +
 +
 +# Regular Expressions
 +
 +The julia manual explains regexes in [Strings](https://​docs.julialang.org/​en/​stable/​manual/​strings/#​Regular-Expressions-1). ​ A good julia introduction to regex is in [Regular_expressions](https://​en.wikibooks.org/​wiki/​Introducing_Julia/​Strings_and_characters).
 +
 +Julia regexes are compliant with the [Perl Compatible Regular Expressions](http://​www.pcre.org/​) library. ​ Regexes can be very involved, because they can be very powerful. ​ They are worthwhile learning if text processing is important. ​ There are many perl regex tutorials and examples on the web, such as the [Perl Regex Tutorial](https://​perldoc.perl.org/​perlretut.html) and the [Regular expressions](http://​jkorpela.fi/​perl/​regexp.html) in a perl "​cheatsheet"​.
 +
 +This chapter focuses on the use of regular expressions within Julia, rather than on explaining regular expressions themselves. ​ It thus assumes that the reader already understands perl regexes.
 +
 +In julia, regular expressions are of type Regex. ​ A regex is almost like a string, but preceded with r when quoted (e.g., `r"​a."​M` [where M is an optional modified], or `Regex("​a."​)`). ​ Regexes are internally compiled to speed up matching. ​ (For search-replace,​ the substitution expression starts with a leading s as in `s"​\1"​`.)
 +
 +
 +```juliarepl
 +julia> typeof( r"​x"​ ), typeof( s"​x"​ )
 +(Regex, SubstitutionString{String})
 +```
 +
 +By default, julia matches across multiple lines and replaces are global.
 +
 +
 +
 +## Is a Regex Valid?
 +
 +```juliarepl
 +julia> isregex(r::​String)= try (Regex(r)!=nothing);​ catch; false; end#try#
 +isregex (generic function with 1 method)
 +
 +julia> isregex.( ["​^x",​ "​((()"​ ])        ## first expression is valid regex, second is not
 +2-element BitArray{1}:​
 +  true
 + false
 +```
 +
 +
 +
 +## Testing Strings For Matches
 +
 +```juliarepl
 +julia> occursin( r"​a[0-9]"​ , "a1 b2 a3 b4 a55 az A6 a7" )           ## at least one match
 +true
 +
 +julia> occursin.( Ref(r"​a[0-9]"​) , [ "​ab",​ " ​ a9", "​99",​ " ​  ac " ] )    ## note the function name's dot postfix. ​ Ref makes sure it is a call.
 +4-element BitArray{1}:​
 + false
 +  true
 + false
 + false
 +```
 +
 +Warning: A third argument to occursin is the character offset.
 +
 +
 +## Number of Matches
 +
 +```juliarepl
 +julia> matchall(r::​Regex,​ s::​AbstractString;​ overlap::​Bool=false)= collect(( m.match for m=eachmatch(r,​s,​overlap=overlap) ));
 +
 +julia> length(matchall( r"​a([0-9])",​ "a1 b2 a3 b4 a55 az A6 a7" ))
 +4
 +```
 +
 +
 +## Finding Matches in Strings
 +
 +### The n-th Match
 +
 +```juliarepl
 +julia> match( r"​a([0-9])"​ , "a1 b2 a3 b4 a55 az A6 a7"​) ​         ## 1 is the (default) first match
 +RegexMatch("​a1",​ 1="​1"​)
 +
 +julia> m= match( r"​a([0-9])"​ , "a1 b2 a3 b4 a55 az A6 a7" , 2)   ## match# 2 requested
 +RegexMatch("​a3",​ 1="​3"​)
 +
 +julia> match( r"​nada"​ , "a1 b2 a3 b4 a55 az A6 a7") == nothing ​    ## no match found
 +true
 +```
 +
 +
 +#### Decoding Match Information
 +
 +```juliarepl
 +julia> matchall(r::​Regex,​s::​AbstractString;​ overlap::​Bool=false)= collect((m.match for m=eachmatch(r,​ s, overlap=overlap)));​
 +
 +julia> m= match( r"​a([0-9])"​ , "a1 b2 a3 b4 a55 az A6 a7" , 2);  ## the *second* match
 +
 +julia> m.match
 +"​a3"​
 +
 +julia> m.offset
 +7
 +
 +julia> m.captures ​                                               ## the captured content
 +1-element Array{Union{Nothing,​ SubString{String}},​1}:​
 + "​3"​
 +
 +julia> m.offsets ​                                                ## start of captures, here ([0-9]) was 2
 +1-element Array{Int64,​1}:​
 + 8
 +
 +julia> m.regex
 +r"​a([0-9])"​
 +
 +```
 +
 +
 +
 +### Finding All Matches
 +
 +```juliarepl
 +julia> matchall(r::​Regex,​s::​AbstractString;​ overlap::​Bool=false)= collect((m.match for m=eachmatch(r,​ s, overlap=overlap)));​
 +
 +julia> match( r"​a([0-9])"​ , "a1 b2 a3 b4 a55 az A6 a7" , 2) ## second match
 +RegexMatch("​a3",​ 1="​3"​)
 +
 +julia> matchall( r"​a([0-9])"​ , "a1 b2 a3 b4 a55 az A6 a7" )
 +4-element Array{SubString{String},​1}:​
 + "​a1"​
 + "​a3"​
 + "​a5"​
 + "​a7"​
 +```
 +
 +To extract more information,​ use, e.g., `m[2].offset`,​ etc:
 +
 +```juliarepl
 +
 +julia> matchall(r::​Regex,​s::​AbstractString;​ overlap::​Bool=false)= collect((m.match for m=eachmatch(r,​ s, overlap=overlap)));​
 +
 +julia> m= matchall( r"​a([0-9])"​ , "a1 b2 a3 b4 a55 az A6 a7" );
 +
 +julia> dump(m[2])
 +SubString{String}
 +  string: String "a1 b2 a3 b4 a55 az A6 a7"
 +  offset: Int64 6
 +  ncodeunits: Int64 2
 +```
 +
 +
 +
 +
 +### Index Locations of all Matches'​
 +
 +```juliarepl
 +julia> for i in eachmatch( r"​a([0-9])",​ "a1 b2 a3 b4 a55 az A6 a7"); println(i.offset,​ ": ", i.match); end#for#
 +1: a1
 +7: a3
 +13: a5
 +23: a7
 +```
 +
 +
 +
 +
 +### Matches With Line or Word Number
 +
 +```juliarepl
 +julia> for (n,l) in enumerate(split( "​ab\ncd\nas\ncd\nmore\ncd\nend",​ "​\n"​))
 +           ​occursin( r"​.d"​ , l ) && println("​L",​ n, ": ", l)
 +       ​end##​for##​
 +L2: cd
 +L4: cd
 +L6: cd
 +L7: end
 +```
 +
 +
 +
 +### Multiple Captures
 +
 +```juliarepl
 +
 +
 +julia> m= match( r"​a.*?​(\w)(\w)(\w) ", "no abcdef andmore azalias end" )  ## 3 captures requested
 +RegexMatch("​abcdef ", 1="​d",​ 2="​e",​ 3="​f"​)
 +
 +
 +julia> m.offset ​ ## start of match
 +4
 +
 +julia> m.captures
 +3-element Array{Union{Nothing,​ SubString{String}},​1}:​
 + "​d"​
 + "​e"​
 + "​f"​
 +
 +julia> m.offsets
 +3-element Array{Int64,​1}:​
 + 7
 + 8
 + 9
 +
 +```
 +
 +
 +## Search-and-Replace in Strings
 +
 +Unlike perl, julia'​s `replace` is *global* by default, but a trailing argument allows specifying the maximum number of replacements.
 +
 +
 +### Plain Regex Search-and-Replace
 +
 +If the third argument is an `s"​..."​`string substitution expression, it can use the captured text, as `\1`
 +
 +```juliarepl
 +julia> replace( "​aabbccddaaabbbcccdddaabbccdd",​ r"​(b+)"​ => s" [\1] ", count=2)
 +"aa [bb] ccddaaa [bbb] cccdddaabbccdd"​
 +```
 +Naming alternatives are `\g<​0>​` or even named captures (`r"​(?<​name>​[a-z])",​ s"​\g<​name>"​`).
 +
 +
 +### Call Function Based on Match for Replacement
 +
 +If the third argument is a (string) function, it is called with the matching text and its result replaces the matching text:
 +
 +```juliarepl
 +julia> replace( "​aabbccddaaabbbcccdddaabbccdd",​ r"​(b+)"​ => uppercase )
 +"​aaBBccddaaaBBBcccdddaaBBccdd"​
 +```
 +
 +#### Example: Hex escaping some URL Characters
 +
 +```juliarepl
 +julia> hex(a::​Char;​ kwargs...)= string( UInt32(a); base=16, kwargs... );
 +
 +julia> replace( ​ "a & b & c" , r"​[^a-z0-9\.]"​ => (s -> "​%"​*hex(s[1])))
 +"​a%20%26%20b%20%26%20c"​
 +```
 +
 +
 +
 +# Various Useful Patterns (Regexes)
 +
 +## Recursive Matching Parens
 +
 +```juliarepl
 +julia> matchall(r::​Regex,​s::​AbstractString;​ overlap::​Bool=false)= collect((m.match for m= eachmatch(r,​ s, overlap=overlap)));​
 +
 +julia> matchall( r"​(\((?>​[^()]|(?​R))*\))", ​ "ab ( cd (ef) gh ) ( blah )")
 +2-element Array{SubString{String},​1}:​
 + "​( cd (ef) gh )"
 + "​( blah )"
 +```
 +
 +
 +## Others
 +
 +^ Purpose ​ ^ Regex  ^ Succeeds ​ ^ Fails  ^
 +| isalpha (only word characters) | `r"​^\w*"​` ​ | "​ab"​ | "ab cd" |
 +| isnonalpha (only non-word characters | `r"​^\W*"​` ​ | " ​ " | "ab cd" |
 +| isalnum (only alnum characters) | `r"​^[\w0-9]*"​` ​ | "​ab1"​ | "ab1 cd" |
 +| integer | `r"​^\s*[+-]?​[0-9]+\s*$"​` | " +12" | "​12.2"​ |
 +| URL matching | [here](http://​www.perlmonks.org/?​node_id=533586) | | |
 +| email matching | [here](http://​emailregex.com/​) | | |
 +| leading/​trailing spaces | `r"​^\s*(.*)\s*$`,​ `s"​\1"​` | | |
 +| matching parens | `r"​(\((?>​[^()]|(?​R))*\))"​` | "(ab (cd) ef) ( gh )" | "(ab " |
 +
 +
 +
 +
 +
 +# Backmatter
 +
 +## Useful Packages on Julia Repository
 +
 +## Notes
 +
 +## References
  
regex.txt ยท Last modified: 2018/12/27 13:27 (external edit)