Separating sentences

Saturday 19 April 2008This is almost 17 years old. Be careful.

One of the things I needed for my new home page design was a way to split a chunk of HTML to get just the text of the first sentence, which I use for the blog posts on the front page.

The preliminaries: these are Django filters, but mostly they’re just string functions, wrapped with Django decorators to make them useful in Django templates.

Here are two helpers:

@register.filter()
@stringfilter
def inner_html(value):
    """ Strip off the outer tag of the HTML passed in.
    """
    if value.startswith('<'):
        value = value.split('>', 1)[1].rsplit('<', 1)[0]
    return value

@register.filter()
@stringfilter
def first_par(value):
    """ Take just the first paragraph of the HTML passed in.
    """
    return value.split("</p>")[0] + "</p>"

These functions are pretty simple, meant to operate on simple inputs. For example, first_par assumes that the opening tag of the HTML is <p>.

Splitting sentences is fairly tricky. I tried searching for a Python snippet, which I didn’t find. I tried thinking about regexes that could do it, but the rules are too complicated. In the end, the code structure I could understand was to break the text into words, and then add words one at a time to a potential sentence, checking it for sentence-hood.

Here’s the rules I came up with for something being a sentence:

  • The end of the sentence must be punctuation (.!?), possibly with closing parens and/or double-quote after it.
  • The next chunk of text has to start with an upper-case letter or number, possibly with an opening paren and/or double-quote preceding it.
  • The sentence can’t end with “Mr.” or titles like it, or an initial. This is to keep the previous two rules from splitting sentences like “Hello Mr. John Q. Public!” incorrectly in the middle.
  • The sentence needs to have balanced parens and double-quotes. This ensures that sentences breaks won’t be identified in quoted material (or parenthetical asides).

These rules seem to work well for picking out the first sentence from each of my 1800-odd blog posts. Here’s the code:

@register.filter()
@stringfilter
def first_sentence(value):
    """ Take just the first sentence of the HTML passed in.
    """
    value = inner_html(first_par(value))
    words = value.split()
    # Collect words until the result is a sentence.
    sentence = ""
    while words:
        if sentence:
            sentence += " "
        sentence += words.pop(0)
        if not re.search(r'[.?!][)"]*$', sentence):
            # End of sentence doesn't end with punctuation.
            continue
        if words and not re.search(r'^[("]*[A-Z0-9]', words[0]):
            # Next sentence has to start with upper case.
            continue
        if re.search(r'(Mr\.|Mrs\.|Ms\.|Dr\.| [A-Z]\.)$', sentence):
            # If the "sentence" ends with a title or initial, then it probably
            # isn't the end of the sentence.
            continue
        if sentence.count('(') != sentence.count(')'):
            # A sentence has to have balanced parens.
            continue
        if sentence.count('"') % 2:
            # A sentence has to have an even number of quotes.
            continue
        break
    
    return sentence

This is coded not for speed but for being able to see what it does and add new clauses as I find broken sentences. The candidate sentence starts out empty. Words are appended to it one at a time, and the sentence checked against the rules. If any rule is violated, we continue to the next word. If all the rules pass, we break out of the loop and return the found sentence.

I know this code isn’t perfect. Here are some things it doesn’t do well:

  • Sentences with single-quote quotes, because just counting them isn’t sufficient. Apostrophes and single-quotes make it so that the number isn’t always even.
  • Text with curly quotes.
  • Sesame Street sentences: “This blog brought to you by the letter B.”
  • Sentences about punctuation, or with code in them.

Actually, there are lots of cases that will not be handled well. Word-play enthusiasts I’m sure will enjoy coming up with examples.

Comments

[gravatar]
The NLTK has the "Punkt" sentence tokenizer which seems pretty good. Though it may be a bit over the top.

http://nltk.org/doc/guides/tokenize.html#punkt-tokenizer
[gravatar]
With all due respect, Ned, you've just built the Segway of sentence parsers - a $5,000 solution to a $5 problem.

Your solution addresses a known subset of cases, but is going to need constant futzing with each time a new syntax is encountered. Worse, it's most often going to be used where the primary requirement is providing summaries up to a maximum length, but it doesn't limit the sentence length if 'value' fails to match one of the anticipated grammers. Of course, it's trivial to add that logic, but then you end up with essentially the same code you probably should have used to begin with:

sentence = value.substr(0, maxLength) + " ...";

Signed,

Xavier "Molehill" McGillicutty ;)
[gravatar]
Thanks. A great tool. It seems to work on this page but does not seem to work on some others including the wikipedia reference page on Readability. Am I doing something wrong?
[gravatar]
@fredrik: It is interesting to see the serious tools that do this kind of work. But you are right: it is overkill for my task, and doesn't even seem to be self-contained: it needs to be trained on a corpus before it will work.

@Robert: opinions may differ. I prefer to have one complete sentence in my summaries. Your site displays the first 50 (or so) words of the post. I don't like the way it cuts off a sentence. And I'm not sure 25 lines of code could be accurately described as a $5000 solution. But I'm glad you've found something that works for you...

@Dorai: this code wasn't meant to operate on complete pages. For example, the first_par function assumes that the input is in the form "<p>....</p>....", so won't be suitable for entire Wikipedia pages. Also, for added fun, the first instance of "<p>" in the page is an empty paragraph!
[gravatar]
I've had to deal with sentence splitting in a work project, and the choice was either to use some elaborate software which is trained on some kind of corpus (and has a not-quite-Free Software licence) or to do the regular expression thing (with extras). I found a good starting point in a message on the NLTK forum/list at SourceForge, although they now don't seem to want to serve up the content:

http://sourceforge.net/mailarchive/message.php?msg_id=14030243

The approach is similar: permit quoting or bracketing around the sentence, and use known punctuation marks to detect the end of the sentence. In addition, I added some postprocessing for well-known abbreviations and initials which, if detected, disallow any splitting suggested by the regex.
[gravatar]
@Ned: Okay, the Segway reference was a bit over the top. My apologies. I appreciate the aesthetic that whole-sentence summaries lend to your site ("very spiffy!"), and there are certainly problem spaces where sentence parsing is useful (so thank you for your contribution there). But I'm not sure blog article summaries is one of them. It's not too difficult to play Devil's advocate and point out multiple shortcomings in your 25-line solution that the more traditional 1-line solution doesn't suffer from...

- Ellipses are great visual cues to users, meaning "more content to follow". But you've removed them. Your summaries actually look more self-contained than they really are and, thus, don't really invite readers to explore the article in quite the same way.
- Your approach results in a non-deterministic amount of content. This makes managing the aesthetic layout of that portion of your home page just a little bit harder.
- What if you want to start with a short, exclamatory sentence? E.g. "Damn! You wouldn't believe what I saw today ...". This approach forces authors to use more expository opening sentences than they might otherwise want.
- If you do need to enforce a maximum length, you end up with two possible truncation schemes ("..." or whatever punctuation the sentence ends with).

Anyhow, I do like the new site look, but I will miss all the opening exclamations I know would otherwise pepper your site. ;)
[gravatar]
Damn! Robert, thanks for your concern! Since your interest in this topic doesn't seem to have waned, I'll answer:

- I might add ellipses. I figured the explicit ">> more.." would be sufficient, but I'm still playing with details like that, so it could still change.
- I'm not worried about the non-deterministic length of the content. The layout is meant to flow, and it will accomodate it. A fixed number of words or even characters isn't a guarantee of fit, either, so I'd rather design the page to allow for the uncertainty.
- I'm not worried about artificial back-pressure on blog posts. First, I think "Damn!" might be a fine first sentence for the home page. Second, by looking at the actual first sentences of my blog posts over the years, I concluded that generally my style lent itself to first-sentence summarization.

Thanks about the home page. Hopefully there won't be fewer opening exclamationsin blog posts because of it. I'll add some to comment responses to make up for it just in case.. ;-)
[gravatar]
On my blog, I have a special field to display the first excerpt which gives more flexibility. In the next revision of the software it will search for a END OF BLURB marker and I simply manually write this html comment marker when I write a blog post.

It isn't automatic but will be more robust.
[gravatar]
hi..
i m working on a project named STATISTICAL SUMMARIZER that requires sentence separator.
So can you please help me out and provide me some solution or any find of help..
i m working in C environment..
[gravatar]
hi.....
i m working on a project named STATISTICAL SUMMARIZER that requires sentence separator.
So can you please help me out and provide me some solution or any find of help..
i m working in JAVA environment..

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.