Even among parallelizable problems, some are harder than others. The easy ones are the ones that don't involve threads sharing state with one another. For example, if you have five independent work queues, and the goal is simply to process all the tasks in all the queues, then there's nothing especially tough about this. You start five threads, run them against the five queues, and then move on once they're done.
The trickier problems are the ones that involve threads sharing state with one another. Let's see why.
Shared state presents a challenge in concurrent programming because threads can interfere with one another if their access to that shared state isn't properly coordinated. If you've ever tried to collaborate with other authors on a single document, you might have some experience with this problem. (Though Google Docs does a nice job of supporting multiple simultaneous authors.) You make some changes to the document, hoping that your coauthor didn't make his own changes to the part that you worked on. If he did, then there may be a mess to clean up.
The essence of the problem is that at least one guy is updating the document while somebody else is either reading it or writing it. We talked about the case where you have two authors writing to the document, but it can be one guy writing and someone else reading. If one person is in the middle of updating emergency information for hurricane evacuees and another person reads a partial update, the result could be dangerous.
Note that there's no problem if you have a bunch of people simply reading a document at the same time. The problem only comes up if at least one of them is writing to it.
The problem we've just described is called a race condition. The idea is that if threads aren't properly coordinated, then their concurrent work is vulnerable to problems associated with "unlucky timing". There can be situations in which one thread is trying to update shared state and one or more other threads are trying to access it (either read or write). In effect the threads "race" against one another and the outcome of their work is dependent on the nondeterministic outcome of their race.
Let's look at a code example.
Originally I was going to present the standard (and probably a bit tired) web page hit counter example. But then today I saw somebody's purportedly threadsafe class that is in fact susceptible to race conditions. The code was written by an experienced engineer. So let's use that example instead, shown in Listing 1 below, as it's instructive.
public class InMemoryUserDao implements UserDao {
private Map<String, User> users;
public InMemoryUserDao() {
users = Collections.synchronizedMap(new HashMap<String, User>());
}
public boolean userExists(String username) {
return users.containsKey(username);
}
public void createUser(User user) {
String username = user.getUsername();
if (userExists(username)) {
throw new DuplicateUserException();
}
users.put(username, user);
}
public User findUser(String username) {
User user = users.get(username);
if (user == null) {
throw new NoSuchUserException();
}
return user;
}
public void updateUser(User user) {
String username = user.getUsername();
if (!userExists(username)) {
throw new NoSuchUserException();
}
users.put(username, user);
}
public void removeUser(String username) {
if (!userExists(username)) {
throw new NoSuchUserException();
}
users.remove(username);
}
}
What do you think? See anything? The engineer claims that the class is threadsafe because we've wrapped our hashmap with a synchronized instance.
It turns out that that's not correct. It's a common misconception that as long as you're dealing with a threadsafe collection (like a synchronized map), you've taken care of any threadsafety issues you might have otherwise had. We'll use this example to show why this belief is just plain wrong.
Before I expose the race conditions, let me state some plausible assumptions about the intended semantics for the class:
createUser() with an user whose username
already exists in the map should generate
a DuplicateUserException.updateUser() with an user whose username does
not already exist should generate
a NoSuchUserException.removeUser() with an user whose username does
not already exist should generate
a NoSuchUserException.Now let's see how the implementation fails to support those semantics, and in particular how the implementation permits race conditions.
createUser() is broken. Here's a race that
shows that createUser() does not have the intended
semantics:
createUser(), with argument user U1
having username N. N is not already in the map.if test, but
not yet to the line that puts N in the map.createUser(), this time with argument user U2
having the same username N that U1 has. (You can imagine, for example,
two users registering with the same username at the same time.) N is
still not in the map, because thread T1 hasn't put it there yet.createUser() and
exits. User U2 and username N are now in the map.Poof! User U2 simply disappears from the map, having been
overwritten by user U1. T1 never encounters the
DuplicateUserException it's supposed to encounter.
updateUser() is broken. This one is a little more subtle than the
createUser() case. Here's the race:
updateUser() with user U having name
N. The user already exists in the map, so T1 passes
the if test successfully. It hasn't reached
the put() part yet.removeUser() with username N (the same one that T1
was using). The user U is removed from the map.The problem here is that the removeUser() method
completed successfully, there was no createUser() call,
and yet there's still a user sitting in the map with no error having
been thrown. Given the semantics described above, that should not be
possible. The only two possible outcomes ought to be: (1) the update
occurs before the remove, which would mean that when the two threads
complete, U is no longer in the map; or (2) the remove occurs before
the update, which would mean that U should be gone, and the update
should have generated an exception. In either case, U should not be in
the map when the two threads complete, and yet there it is. So if T2
belongs to an admin who thought he was removing a troublesome user,
guess again.
removeUser() is broken. We've already seen in the race above that
there are situations under which removeUser() would be expected to
remove a user but does not actually do that. Here's another race for you,
probably less significant, but still a bug:
removeUser() with existing username
N, and makes it past the if test.removeUser() with the same username N.In this case we have two methods that called
the removeUser() method with the same username, but
neither thread encountered a NoSuchUserException. That is
not compatible with the semantics of the class.
Now that we've seen some problems, let's try to understand what's happening here.
One thing that all three cases have in common is that they follow a
"check-then-act" pattern, where both the "check" and "act" parts
reference shared data (in this case, the
hashmap). The createUser() method checks whether the
username already exists, and if not, adds the user to the
map. The updateUser() method checks whether the username
already exists, and if so, updates the
user. removeUser() checks whether the username exists, and if so,
deletes the user.
Note that the findUser() method does not
follow the same pattern. It does a single get() against
the map, and then does a check that checks the returned user rather
than checking something against the map. Though it looks like almost
the same thing, it is really a completely different situation because
we're performing only one action against the shared data instead of
multiple actions.
It turns out that the difference between createUser(),
updateUser() and removeUser(), on the one hand, and
the findUser() method, on the other, is important from a threadsafety
point of view. Essentially what's happening is that the findUser()
method is atomic (with a certain qualification, which we'll discuss in
the next section), meaning that it executes as a single action. The
createUser(), updateUser() and removeUser()
methods, on the other hand, are not atomic. The various check and act operations
can be interleaved in arbitrary ways across different threads. Intuitively what
that means is that checks, which are supposed to provide safety, can become
invalid because other actions can be scheduled in between any given check-act pair.
The safety checks are invalidated and the class behaves in strange ways (especially
as load increases, which is usually the most inopportune time for such issues to
arise).
Now let's look at how we can solve race conditions such as the ones we've been considering.