Saturday 19 April 2008 — This is almost 17 years old. Be careful.
One of the things I needed for my new home page design was a way to split a chunk of HTML to get just the text of the first sentence, which I use for the blog posts on the front page.
The preliminaries: these are Django filters, but mostly they’re just string functions, wrapped with Django decorators to make them useful in Django templates.
Here are two helpers:
@register.filter()
@stringfilter
def inner_html(value):
""" Strip off the outer tag of the HTML passed in.
"""
if value.startswith('<'):
value = value.split('>', 1)[1].rsplit('<', 1)[0]
return value
@register.filter()
@stringfilter
def first_par(value):
""" Take just the first paragraph of the HTML passed in.
"""
return value.split("</p>")[0] + "</p>"
These functions are pretty simple, meant to operate on simple inputs. For example, first_par assumes that the opening tag of the HTML is <p>.
Splitting sentences is fairly tricky. I tried searching for a Python snippet, which I didn’t find. I tried thinking about regexes that could do it, but the rules are too complicated. In the end, the code structure I could understand was to break the text into words, and then add words one at a time to a potential sentence, checking it for sentence-hood.
Here’s the rules I came up with for something being a sentence:
- The end of the sentence must be punctuation (.!?), possibly with closing parens and/or double-quote after it.
- The next chunk of text has to start with an upper-case letter or number, possibly with an opening paren and/or double-quote preceding it.
- The sentence can’t end with “Mr.” or titles like it, or an initial. This is to keep the previous two rules from splitting sentences like “Hello Mr. John Q. Public!” incorrectly in the middle.
- The sentence needs to have balanced parens and double-quotes. This ensures that sentences breaks won’t be identified in quoted material (or parenthetical asides).
These rules seem to work well for picking out the first sentence from each of my 1800-odd blog posts. Here’s the code:
@register.filter()
@stringfilter
def first_sentence(value):
""" Take just the first sentence of the HTML passed in.
"""
value = inner_html(first_par(value))
words = value.split()
# Collect words until the result is a sentence.
sentence = ""
while words:
if sentence:
sentence += " "
sentence += words.pop(0)
if not re.search(r'[.?!][)"]*$', sentence):
# End of sentence doesn't end with punctuation.
continue
if words and not re.search(r'^[("]*[A-Z0-9]', words[0]):
# Next sentence has to start with upper case.
continue
if re.search(r'(Mr\.|Mrs\.|Ms\.|Dr\.| [A-Z]\.)$', sentence):
# If the "sentence" ends with a title or initial, then it probably
# isn't the end of the sentence.
continue
if sentence.count('(') != sentence.count(')'):
# A sentence has to have balanced parens.
continue
if sentence.count('"') % 2:
# A sentence has to have an even number of quotes.
continue
break
return sentence
This is coded not for speed but for being able to see what it does and add new clauses as I find broken sentences. The candidate sentence starts out empty. Words are appended to it one at a time, and the sentence checked against the rules. If any rule is violated, we continue to the next word. If all the rules pass, we break out of the loop and return the found sentence.
I know this code isn’t perfect. Here are some things it doesn’t do well:
- Sentences with single-quote quotes, because just counting them isn’t sufficient. Apostrophes and single-quotes make it so that the number isn’t always even.
- Text with curly quotes.
- Sesame Street sentences: “This blog brought to you by the letter B.”
- Sentences about punctuation, or with code in them.
Actually, there are lots of cases that will not be handled well. Word-play enthusiasts I’m sure will enjoy coming up with examples.
Comments
http://nltk.org/doc/guides/tokenize.html#punkt-tokenizer
Your solution addresses a known subset of cases, but is going to need constant futzing with each time a new syntax is encountered. Worse, it's most often going to be used where the primary requirement is providing summaries up to a maximum length, but it doesn't limit the sentence length if 'value' fails to match one of the anticipated grammers. Of course, it's trivial to add that logic, but then you end up with essentially the same code you probably should have used to begin with:
sentence = value.substr(0, maxLength) + " ...";
Signed,
Xavier "Molehill" McGillicutty ;)
@Robert: opinions may differ. I prefer to have one complete sentence in my summaries. Your site displays the first 50 (or so) words of the post. I don't like the way it cuts off a sentence. And I'm not sure 25 lines of code could be accurately described as a $5000 solution. But I'm glad you've found something that works for you...
@Dorai: this code wasn't meant to operate on complete pages. For example, the first_par function assumes that the input is in the form "<p>....</p>....", so won't be suitable for entire Wikipedia pages. Also, for added fun, the first instance of "<p>" in the page is an empty paragraph!
http://sourceforge.net/mailarchive/message.php?msg_id=14030243
The approach is similar: permit quoting or bracketing around the sentence, and use known punctuation marks to detect the end of the sentence. In addition, I added some postprocessing for well-known abbreviations and initials which, if detected, disallow any splitting suggested by the regex.
- Ellipses are great visual cues to users, meaning "more content to follow". But you've removed them. Your summaries actually look more self-contained than they really are and, thus, don't really invite readers to explore the article in quite the same way.
- Your approach results in a non-deterministic amount of content. This makes managing the aesthetic layout of that portion of your home page just a little bit harder.
- What if you want to start with a short, exclamatory sentence? E.g. "Damn! You wouldn't believe what I saw today ...". This approach forces authors to use more expository opening sentences than they might otherwise want.
- If you do need to enforce a maximum length, you end up with two possible truncation schemes ("..." or whatever punctuation the sentence ends with).
Anyhow, I do like the new site look, but I will miss all the opening exclamations I know would otherwise pepper your site. ;)
- I might add ellipses. I figured the explicit ">> more.." would be sufficient, but I'm still playing with details like that, so it could still change.
- I'm not worried about the non-deterministic length of the content. The layout is meant to flow, and it will accomodate it. A fixed number of words or even characters isn't a guarantee of fit, either, so I'd rather design the page to allow for the uncertainty.
- I'm not worried about artificial back-pressure on blog posts. First, I think "Damn!" might be a fine first sentence for the home page. Second, by looking at the actual first sentences of my blog posts over the years, I concluded that generally my style lent itself to first-sentence summarization.
Thanks about the home page. Hopefully there won't be fewer opening exclamationsin blog posts because of it. I'll add some to comment responses to make up for it just in case.. ;-)
It isn't automatic but will be more robust.
i m working on a project named STATISTICAL SUMMARIZER that requires sentence separator.
So can you please help me out and provide me some solution or any find of help..
i m working in C environment..
i m working on a project named STATISTICAL SUMMARIZER that requires sentence separator.
So can you please help me out and provide me some solution or any find of help..
i m working in JAVA environment..
Add a comment: