One of the things I needed for my new home page design was a way to split a chunk of HTML to get just the text of the first sentence, which I use for the blog posts on the front page.
Here are two helpers:
""" Strip off the outer tag of the HTML passed in.
value = value.split('>', 1).rsplit('<', 1)
""" Take just the first paragraph of the HTML passed in.
return value.split("</p>") + "</p>"
These functions are pretty simple, meant to operate on simple inputs. For example, first_par assumes that the opening tag of the HTML is <p>.
Splitting sentences is fairly tricky. I tried searching for a Python snippet, which I didn't find. I tried thinking about regexes that could do it, but the rules are too complicated. In the end, the code structure I could understand was to break the text into words, and then add words one at a time to a potential sentence, checking it for sentence-hood.
Here's the rules I came up with for something being a sentence:
- The end of the sentence must be punctuation (.!?), possibly with closing parens and/or double-quote after it.
- The next chunk of text has to start with an upper-case letter or number, possibly with an opening paren and/or double-quote preceding it.
- The sentence can't end with "Mr." or titles like it, or an initial. This is to keep the previous two rules from splitting sentences like "Hello Mr. John Q. Public!" incorrectly in the middle.
- The sentence needs to have balanced parens and double-quotes. This ensures that sentences breaks won't be identified in quoted material (or parenthetical asides).
These rules seem to work well for picking out the first sentence from each of my 1800-odd blog posts. Here's the code:
""" Take just the first sentence of the HTML passed in.
value = inner_html(first_par(value))
words = value.split()
# Collect words until the result is a sentence.
sentence = ""
sentence += " "
sentence += words.pop(0)
if not re.search(r'[.?!][)"]*$', sentence):
# End of sentence doesn't end with punctuation.
if words and not re.search(r'^[("]*[A-Z0-9]', words):
# Next sentence has to start with upper case.
if re.search(r'(Mr\.|Mrs\.|Ms\.|Dr\.| [A-Z]\.)$', sentence):
# If the "sentence" ends with a title or initial, then it probably
# isn't the end of the sentence.
if sentence.count('(') != sentence.count(')'):
# A sentence has to have balanced parens.
if sentence.count('"') % 2:
# A sentence has to have an even number of quotes.
This is coded not for speed but for being able to see what it does and add new clauses as I find broken sentences. The candidate sentence starts out empty. Words are appended to it one at a time, and the sentence checked against the rules. If any rule is violated, we continue to the next word. If all the rules pass, we break out of the loop and return the found sentence.
I know this code isn't perfect. Here are some things it doesn't do well:
- Sentences with single-quote quotes, because just counting them isn't sufficient. Apostrophes and single-quotes make it so that the number isn't always even.
- Text with curly quotes.
- Sesame Street sentences: "This blog brought to you by the letter B."
- Sentences about punctuation, or with code in them.
Actually, there are lots of cases that will not be handled well. Word-play enthusiasts I'm sure will enjoy coming up with examples.