Multi-threading?

Thursday 28 July 2005This is 19 years old. Be careful.

Almost two weeks ago, Roushan Ali asked a simple question on the sqlite-users mailing list:

Can we use single sqlite_open handle(global) across threads (if all database operations are serialized by using semaphore)? Please help.

D. Richard Hipp, the author of SQLite (an awesome embedded relational database, by the way) responded:

Actually, this seems like a good opportunity to repeat my oft-ignored advice to not use more than one thread in a single address space. If you need multiple threads, create multiple processes. This has nothing to do with SQLite = it is just good programming advice. I have worked on countless multi- threaded programs over the years, and I have yet to see a single one that didn’t contain subtle, hard to reproduce, and very hard to troubleshoot bugs related to threading issues.

And from there, the deluge began. The discussion is going on 40 responses and has turned into a computer science equivalent of a pig-pile smack-down, with competing claims, snotty asides, arrogant put downs, references to papers, snarly analysis of same, Linux vs Windows sniping, and so on.

Sifting through the noise, there’s a lot of interesting material about how best to develop software which has to “do many things at once”. Someone pointed to The C10K Problem, which looks like an interesting summary of the options for building high-throughput network servers.

I’ve done lots of multi-threaded coding, and it is difficult. If you can keep the cardinal rules in mind (protect shared data and impose a strong ordering on locks) and be extremely diligent, you can make it work. But when it doesn’t, you have to do some very difficult debugging. I’ve never used the other styles, and the arguments for them sound very compelling.

Comments

[gravatar]
As some in the thread say, there are times when threading is appropriate and there are times when it's not. As I put it in my server-design article (referenced from C10K):

One popular variant of this approach is to use only one thread, ever; while such an approach does avoid context thrashing, and avoids the need for locking as well, it is also incapable of achieving more than one processor's worth of total throughput and thus remains beneath contempt unless the program will be non-CPU-bound (usually network-I/O-bound) anyway.

If you want to take advantage of having multiple processors, and we're now in a world where the latest version of the processor most people use on their desktop has two physical and four logical processors on a single die with multi-die systems being common, you have to use multiple threads. Yes, it's harder to do right. Yes, there might be more bugs that are harder to find, and not only when the people writing the code are clueless. That's also true of writing network code, but they're not excuses why either shouldn't be done. By all means, if you're writing a type of application that does not need to harness multiple processors and your programming model supports it, use one thread. Don't assume everyone else lives in such a convenient and simple world, though. The advice to use multiple processes instead is just horrible from a performance standpoint because a process switch does everything a thread switch does and then some.

I'd also like to address some specific comments in the other thread about single-threaded versions of programs touching fewer pages and running faster than their multithreaded counterparts. The first part is just bunk, confusing correlation with causation. I've seen many multithreaded programs that have good locality of reference, and single-threaded ones that are terrible. The second part is often true under light load but that's just not interesting in many cases. A web server, for example, typically exhibits a particular level of throughput for a single connection, then increasing throughput as connections increase until saturation is reached, then either flat or declining performance thereafter. Who cares about the single-connection performance? Just about nobody, unless it's really awful. What really matters in that and many similar environments is the height of that peak and the number of connections necessary to reach it. It's common for multi-threaded servers to start at a slightly lower point (nobody cares) but have a much higher peak at a much higher number of connections (people care a lot). If this is the class of program you work with, which is CPU rather than network- or I/O-bound, you simply can't make excuses about the difficulty of multithreading. You have to deal with it instead if you want to be competitive with other products using similar platforms.
[gravatar]
Addendum: it's amazing to hear anyone associated with SQLite make the multiple-process suggestion considering that much of the reason people might use SQLite instead of (for example) MySQL is precisely because it lives in the same process and thus does not require a full context switch per operation. One might think that indicates some awareness of the tradeoff between context-switch cost vs. separate-address-space safety, but apparently not.
[gravatar]
To clarify: The multi-process suggestion for SQLite is not that SQLite should be in one process with the application code in another, but that the application itself should be multi-process, with each process running SQLite embedded inside it, accessing the same database file.
[gravatar]
A bit off-topic - I find gmane gives me a better big-picture view of a thread:
http://thread.gmane.org/gmane.comp.db.sqlite.general/13124
[gravatar]
Thanks for the gmane link. I looked at gmane and somehow missed the way to get a link like that.
[gravatar]
Most of the time this mis-preception seems to be because of one of two things: 1) Universities seem happy to tell students to "just create threads". 2) AIUI win32 pretty much requires threads. The later is also why firefox, openoffice, apache-httpd-2.0 and Java etc. all do it, it's pretty much the only portable option. Of course then the fact so many high profile applications do it just reenforces it for everyone else who hasn't thought about it :(.

> If you want to take advantage of having multiple processors, [...], you have to use multiple threads.

s/threads/tasks/ ... Those who do not understand fork() are doomed to reimplement it poorly.

> context-switch cost vs. separate-address-space safety

I can pretty much guarantee you've never benchmarked a context switch between two tasks sharing an address space vs. two tasks not sharing one. esp. given the comment above. Here's a free clue a TLB flush is _all_ you can possibly save, and to trade for that you introduce locking everywhere, bounce data between cache lines and take a big dump on the legability of your source code.

Here's another free clue, if you only have 4 cores your application doesn't magically go faster if you have 40 threads ... each core can only physically run one at a time. Sure, if you only have 4 tasks one might halt for 10msecs when you take a page fault ... but for the other 99.99% of the time you aren't switching back and forth competing for the same CPU resources (and if the tasks aren't competing for CPU, then why the hell are they running?).

> If you can keep the cardinal rules in mind ([...]) and be extremely diligent, you can make it work

Given how often I see deadlock fixes, or fixes for races, or fixes for just plain screwing up sharing data go into the Linux kernel ... and I can see how many non-stupid people are working on it, I find it hard to believe when people say "you can make it work, just be diligent". The same can be said for pretty much every application I use that links against pthread (well, in some cases minus the many non-stupid people :).
It seems a lot like the C-style strings halucination of "you can make it work, you just really really need make sure you use str*() and mem*() correctly this time". In my experience this is just not true http://www.and.org/vstr/security

On the bright side, I've started seeing more people publicly speak out against the madness. The last memorable one was Andrew Tridgell:

http://sourcefrog.net/weblog/software/languages/java/java-threads.html
[gravatar]
>I can pretty much guarantee you've never benchmarked a context switch between two tasks sharing an address space vs. two tasks not sharing one.

Don't be an asshole, James. I've been doing such microbenchmarks for fifteen years and have even on occasion published results on my website. Your assumption was unwarranted and obnoxious and only shows your own immaturity.


>Here's a free clue a TLB flush is _all_ you can possibly save,

Here's a free clue for you, James: TLB flushes are far from cheap. I've actually implemented the code to do them on MP systems, so I know. I also worked on nUnix at Encore which is was the predecessor of the Linux process/thread model (different from most others), so I know that TLB flushes are not necessarily the only thing you save even if that happens to be the case in Linux. Have you ever done one of those microbenchmarks that you demand others do, to measure the cost? If you didn't see that it was significant then you did the benchmark wrong.


>to trade for that you introduce locking everywhere, bounce data between cache lines and take a big dump on the legability of your source code

MP code is only illegible to novices and incompetents. Yes, if you're writing concurrent code you have to add locking and think about locality/contention issues, but that's the price of living in the real world instead of some toy-program paradise.


>if you only have 4 cores your application doesn't magically go faster if you have 40 threads ... each core can only physically run one at a time

I already addressed precisely this issue of the relationship between processors and tasks/threads/whatever, and much else that you say, in the server-design article I already cited. You should have read it before making quite such a fool of yourself.


>Given how often I see deadlock fixes, or fixes for races, or fixes for just plain screwing up sharing data go into the Linux kernel ... and I can see how many non-stupid people are working on it

I've seen the same things, and reached a different conclusion: the people doing these things in the Linux kernel simply aren't as smart as they think they are. I've worked on MP kernels where these issues were much less severe, because the people I was working with knew what they were doing. There's a learning curve associated with programming concurrent systems, and the sad fact is that many of the Linux kernel hackers started making their changes before they had climbed it. Remember, Linus said for many years that MP was an exotic technology that wasn't worth supporting...until he was proven wrong. Then the same with NUMA. Some people just want excuses for not doing what they have to do, but that doesn't mean others should accept those excuses.


>I find it hard to believe when people say "you can make it work, just be diligent"

Nobody said that, James. There's more to it than just being diligent, but that doesn't mean the need can be avoided. If you want to do certain things you have to accept a certain level of additional complexity. Many things are easy to do in simple ways that don't scale or perform well, and I already said that if that's sufficient people should go for it, but if you really want to get top-tier performance for a particular kind of task you almost always have to accept some additional complexity.


>On the bright side, I've started seeing more people publicly speak out against the madness. The last memorable one was Andrew Tridgell

First, appeals to authority are fallacies. Second, Tridgell is as wrong as you are when he assumes TLB flushes are cheap. Third, the problems with Java threads and how people use them are an entirely different topic. Nobody in their right mind would claim that all thread implementations are reasonable, or that there aren't a lot of idiots (mis)using threads. However, the anti-thread meme is just as overly simplistic as the pro-thread one.

[gravatar]
Tridge also gets a few other things wrong in his response that you so revere. For example, part of his reasoning is based on the cost of a copy that any good network programmer should have eliminated anyway. He seems only dimly aware of well known ways to reduce the contention cost for mapping file descriptors, or the fact that even if the descriptor tables themselves are separate the files themselves have to be locked so the separate-table advantage is negligible. Locking carries only a negligible cost (locked bus cycles) except when there's contention, and that can matter in aggregate if your overall locking paradigm is stupid, but with intelligent locking and scheduling it's not a significant performance factor.

Lastly, Tridge screws up when he assumes that people use the same IPC between threads that they would between processes. Even where that appears to be the case, any truly thread-aware IPC package should resolve the same calls to inline equivalents instead of syscalls when the IPC is process-local. The cost of putting things in separate processes is not just the time taken per context switch; it's also the number of context switches that are necessary. Separate processes means separate address spaces (mostly) which means whenever there's any state (explicit or otherwise) that needs to be exchanged between tasks it might force a context switch that would have been unnecessary with a shared address space. Explicit shared memory doesn't really solve the problem because it reintroduces all of the problems with memory shared via threading plus additional setup/cleanup/pointer-conversion complexity. Message batching can help to process multiple state changes with a single context switch, but no amount of batching is as good as not having to switch at all (just as no amount of interrupt aggregation is as good as not needing to interrupt). In a threaded environment you have the option of direct function calls and memory references instead of explicit IPC, avoiding context switches altogether in many cases. A lot of novices abuse that capability, get in trouble, and end up blaming threads, but it's possible to use it judiciously to get good performance. Ultimately you have to make an intelligent decision about what kinds of coupling are necessary and tolerable in your program, instead of relying on simplistic "threads are always good/evil" formulae.

Anybody who only discovered network programming two years ago and thinks that Tridge's email (in the last paragraph of which he even concedes that there's no single right answer) is a good argument against ever using threads really needs to recalibrate their idea of what constitutes a good argument.
[gravatar]
> Don't be an asshole, James. I've been doing such microbenchmarks for fifteen years and have even on occasion published results on my website

I appologise, both the browsers I use have JavaScript turned off by default ... and Ned's site doesn't show name's or websites without JS. So all I had to go off was the comment itself, and I concluded that you were repeating things you'd heard without having thought/examined them.
So, again, I am sorry I didn't check better and assumed too much.

> If you didn't see that it was significant then you did the benchmark wrong.

Generally I try and avoid confrontations with people who I am pretty sure have a good idea what they are talking about (like David Schwartz http://groups-beta.google.com/group/comp.unix.programmer/browse_thread/thread/5dabf4e6e448e1c6/dc1b0445707feafb?hl=en#dc1b0445707feafb), because they will almost always have a very different view of the world and how to design it and are about as likely to change to not using threads as I am to using them. Making heated arguments annoying and without conclusion.

So ok, I'll consede your point that TLB flushes can be measured ... but my view of the world suggests that if that matters you've done something wrong. The main part of your application shouldn't be spent context-switing from task X to task Y, you have basically one task per. CPU core and consume all reasources within it.
But again, David has a world view that you want no more than 100 network connections per. task giving you 100 tasks to handle 10,000 network connections ... I can see how TLB flushes might be measurable in that case, although I've not done that measurment, but in my world view you'd have two processes (on a dual CPU core box) handling all 5,000 connections each and context-switching would be noise (and, yes, I've attempted to measure that).

> MP code is only illegible to novices and incompetents

I didn't say it was illegible. Mearly that it is less legible. For instance the following simplistic code:

if (foo(...) == -1)
fprintf(stderr, "foo: %s\n", strerror(errno));

...now turns into:

if (foo(...) == -1)
{
char buf[1024];

if (strerror_r(errno, buf, sizeof(buf)))
buf[0] = 0;
fprintf(stderr, "foo: %s\n", buf);
}

...and sure, you can wrap certain cases to make the ugliness be contained and you can argue that sometimes the *_r() version is "not that ugly", but the fact that it's something that has to permeate to source is just inherently ugly to me.

> I already addressed precisely this issue ... in the server-design article I already cited

I'd already read it, I just didn't agree with everything. For instance Data Copies section is severly reduced advise due to Vstr that I wrote (but I wrote this after you wrote the paper). Also in your "Context Switches" section you again equate task==thread, my webserver can have one _process_ per with no threads. Ie. full advantage of multiple CPUs, close to zero context-switching. I admit there aren't a lot of applications with this model, but then there aren't many threaded apps. doing as well as SEDA etc. (mostly it seems, to me, to be more of "not sure how to remove this blocking code ... so we'll "just add a thread").

> the people doing these things in the Linux kernel simply aren't as smart as they think they are

While I'll admit not all kernel contributors are the gods they think they are, there are more than a few who I'd be happy to say "I'm not significantly better than X" ... and those people still make the mistakes (although probably less often).
And it's not like Solaris, Netscape LDAP server, IIS or whatever are free of the same problems.
This compared with I, who've had almost zero problems with any kind of deadlock of race condition.
Also as a comparison the "large" pieces of very stable and scalable code I have seen in the wild (NetAPP, Bloomberg) use designs that are more like event driven or co-operative "threading".

As with some of the links I provided before, my opinion is that this is much like the reason people are moving towards GC, away from using manually allocated C strings and have moved away from DOS/Mac OS 7 ... yes, in theory, it can be done as well but the cost far outweighs the benefits.

> If you want to do certain things you have to accept a certain
> level of additional complexity. Many things are easy to do in
> simple ways that don't scale or perform well, and I already
> said that if that's sufficient people should go for it, but if you
> really want to get top-tier performance for a particular kind of
>task you almost always have to accept some additional
> complexity.

I pretty much agree with all of that, maybe even said that exact thing (although maybe not as well) ... and I've said it as a reason why you don't want to use threads.
You can create processes, and create well defined means for them to communicate about the things they need to communicate ... and this is iniitally much more complex than just adding pthread_create() and some locks. The advantage comes later when, to find bugs, you only have to look at the interaces used to communicate and not the entire code path of all threads.

> However, the anti-thread meme is just as overly simplistic as the pro-thread one.

I admit that there is a certain amount of absolutism here, I decided a long time ago that C-style strings were so hard to get completely correct and any speed advantage over a well designed interface so small that it was just "bad", with no room for debate. And certainly that seems to have worked for me, and the only applications that don't have buffer overflows seem to be ones written by people with a similar sense of absolutism ... and I see a declining number of people aruging against that now (although often they advocate using an entire managed language, which seems a little too much absolutism to me).
While I can certainly imagine cases where you really need to go multi-threaded and not just multi-tasked, I think of it much like aruging against C-style strings 10 years ago ... there are very few cases where you can even realisticly argue that you need it, and so many advantages of going a different way by default.
[gravatar]
>For instance Data Copies section is severly reduced advise due to Vstr that I wrote (but I wrote this after you wrote the paper).

That's OK if your universe continues only of strings, but that's generally true only for one sector of the industry. Serious networking (e.g. router code) and storage both deal more in blocks of opaque binary data. In some cases the data even have to be read into memory within a specific address range or aligned a specific way, and might not even be directly accessible to the protocol code. Vstr won't help with that, and I was trying to address a more general set of scenarios.

>in your "Context Switches" section you again equate task==thread, my webserver can have one _process_ per with no threads.

We've both been a bit sloppy with terminology. I try to use "task" to mean an abstract control-flow concept, and "thread" or "process" for the OS-managed entities onto which those are mapped, though I do slip sometimes especially with tasks vs. threads. The important point to which I was referring is that, regardless of threads vs. processes, the number of active tasks should not exceed the number of processors available to execute them. Only having a single thread is one way to do that, but IMO a degenerate way.

>mostly it seems, to me, to be more of "not sure how to remove this blocking code ... so we'll "just add a thread").

You're thinking in easy-problem-domain terms again, but in the broader world it's often not just a matter of convenience. Sometimes the asynchronous interface you want just doesn't exist; many drivers, for example, provide certain functions only in synchronous form. Even when dealing with open source, it might not be practical to hack a driver for a complex device (Fibre Channel HBAs and InfiniBand HCAs come to mind) to provide asynchronous equivalents, and then have to re-merge those changes with every code drop you get from then on. Such scenarios occur often enough that any paradigm/framework that cannot accommodate the odd synchronous blocking call has to be considered immature or incomplete.

>in theory, it can be done as well but the cost far outweighs the benefits.

Really? Are you aware that it's necessary for just about any kind of enterprise-class equipment - e.g. a high-end router or disk array - to use such approaches in order to scale/perform as it does? Your computing experience, and even your experience as a consumer at any store owned by a company that uses such devices in their data center, would be quite different if they were implemented the way you seem to think they should be. The computing world does not consist only of PCs that are practically always I/O-bound except for games. Writing programs that only scale up to the limits of such machines and environments is a trivial exercise and it hardly matters what programming model you use. When you get into the big leagues where it does matter, though, you'll find that concurrency is just a universal fact of life that people must deal with.

>there are very few cases where you can even realisticly argue that you need it, and so many advantages of going a different way by default.

Maybe you're not familiar with such cases, but you shouldn't therefore assume they don't exist or are of no consequence. They're out there, and they matter, and you owe it to yourself and everyone else to become familiar with them before you start telling other people how things should be done.
[gravatar]
I like to eat my toast butter side down.
[gravatar]
I never eat hard-boiled eggs with the little end pointing up. All little-endians are morons.
[gravatar]
See? That's what I love about you guys: there's just as much heat and intelligence here as in the original thread, but here we've got two allusions to literature about pointless debates. And one's a kid's book, and one's a classic!

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.