Advanced Text Search and Replace using Python

08 Jan 2013

The last post was just a short introduction to using Python aimed at Bioinfomaticians. In this post we are going to cover one of the most useful topics dealing with any text processing. Perl’s the “goto” scripting language for this, but frankly, I hate Perl because of its very unreadable syntax. Python supports all of the nifty features of extended regular expressions too so lets give it a go.

Module Contents

Alright, first off lets head over to the python docs for the regular expression module. The module name is re.

What you are interested in is the first bit about all the special characters. These are your friends. They are how you create matching groups, how to use predefined character groups and other great things. Doesn’t make sense yet right? No problem, lets scroll down a bit more and view the Module Contents. Here you can read through the main methods that you can use to do matching and such.

I’ll narrow it down for you for now by just saying only use re.match and re.search for now. Later on when you get more cozy with regular expressions you can try some of the other stuff.

The next thing you need to know about re.match and re.search is the difference between them. Don’t forget this!! re.match matches beginning from the beginning of the text where re.search matches anywhere in the text.

Read a better description for this here.

Alright, so moving forward with the post we know that we can either match starting with the beginning of the text or anywhere in the text. Why have two methods that seem to do the same? Because they can! Don’t worry about it.

I’ll probably just stick to search only for this post.

Match Objects

Before we can get to the good stuff of searching, we need to know that any time you do a search or match it will either return a Match object or None. If it returns None then it means that your regular expression pattern did not match anything. That is frustrating and I’ll give you some tips on how to get by that later.

If it doesn’t return None, then it returns a match object that represents the stuff that was matched.

http://docs.python.org/2/library/re.html#match-objects

The things we are interested from match objects are group, groups and groupdict. We will detail how to use these once we are on the road of matching in the next section.

Just know this about each.

  • group returns all the text that was matched
  • groups returns a tuple/list of the grouped texts that was matched
  • groupdict returns a dictionary of the named grouped texts that was matched

Alright, first things first. Before you can use the regular expression module you need to import it. In order to import it you just simply use

>>> import re

Ok, now we are set to start the fun. We will start small and build big.

>>> m = re.search( 'ponies are', 'pretty pink ponies are cute' )
>>> m.group()
'ponies are'
>>> m.groups()
()
>>> m.groupdict()
{}

Notice that it searched through the text and found ‘ponies are’ which is what we were searching for. groups() and groupdict() both were empty because we didn’t group any text. Now, lets just try a group for fun.

>>> m = re.search( 'pretty (pink) ponies (are) cute', 'pretty pink ponies are cute' )
>>> m.group()
'pretty pink ponies are cute'
>>> m.groups()
('pink', 'cute')
>>> m.groupdict()
{}

Ok, now we are having some fun right? Notice that group() again returned all the text that was matched. groups() now returned the text that was grouped though. groupdict() was blank again because we have not defined a named group…that is next

>>> m = re.search( 'pretty (?P<matchone>pink) ponies (?P<whatever>are) cute', 'pretty pink ponies are cute' )
>>> m.group()
'pretty pink ponies are cute'
>>> m.groups()
('pink', 'cute')
>>> m.groupdict()
{'matchone': 'pink', 'whatever': 'are'}

Alright, there you go, you have the basics of regular expressions in Python.

Little more advanced. Search a file

Sticking with the bioinformatics theme, lets pretend we have a fasta file(text file representing some DNA sequence). Copy and paste this text into a file called myfasta.fna

>sequencename1 length=10 isitbacon=False
ATGCAAGGCA
>sequencename2 lengthyturkey=12 isitbacon1234=True
ATGCCCCCAAGG

Ok, so we have our fake file to play with. More than likely the file would only have length= for both lines and isitbacon=. Haha, well welcome to Science!!! Never know for sure what you are going to get. REGULAR EXPRESSIONS TO THE RESCUE!!

Really, who cares what the text in the file is we just want to do some useful searching and maybe replacing. While in the same directory as the file you just created, fire up the python interpreter

>>> # Start by importing the module
>>> import re
>>> # Lets read in our fasta file so we can play with it
>>> for line in open( 'myfasta.fna' ):
...     m = re.search( 'lengthw*=(w+)', line )
...     # Need to ensure there is a match object returned
...     if m:
...         m.groups()
('10',)
('12',)

Let’s ingest this.

So we are looping through each line in the file. Assigning the variable _line _the value of each line. Then we are performing a regular expression search on that line.

Breaking down the regular expression

Match the the word length followed by zero or more Letter, digit or underscore followed by an equal sign followed by a group one or more letter, digit or underscore.

The if statement is needed as the regular expression will not match the lines that do not match(aka the sequence lines not beginning with a >)

Where to go from here

  • This Site – Here you can try regular expressions using any of the main scripting languages. Very useful to test your regular expressions.

  • Build your regular expressions from the left to the right a little at a time. If you build a huge one and then try it out get ready for a painful experience.
  • Be prepared to be confused! It happens, don’t worry. Save your current expression somewhere else and then start reducing it in size until it matches. Then build it back up from there.
  • Don’t forget to check to make sure your expression matched. That is, don’t do this unless you are so sure it will match that you are willing to put it into space shuttle code.
>>> import re
>>> mymatches = re.match( 'some pattern', 'some text' ).groups()
  • Named groups are great. I love dictionaries.
  • Get a good grasp of the items under this link
  • Learn what the re.MULTILINE and re.DOTALL do
  • Be very careful with .* it will bite you. Just know that .*? is out there to help you
«« Previous Post Next Post »»