Reinforcement in psychology and its positive and negative examples

B.F. Skinner, one of the main theorists of behaviorism, defined reinforcement as a type of learning based on associating behavior with consequences that follow from it, thereby increasing the likelihood of its repetition. When the consequences are negative, we talk about punishment, and when they are positive, we talk about support or praise. Within reinforcement learning, experts distinguish between two types of consequences: positive and negative.

While positive reinforcement occurs when a behavior is associated with something approved, negative reinforcement is about avoiding or withdrawing an aversive stimulus. Let's look at the main features of both procedures and talk about how you can use reinforcement in everyday life.

In this article:

What is positive reinforcementExamples of positive reinforcement in the familyWhat is negative reinforcementPrimary reinforcers - satisfaction of basic needsSecondary reinforcers - reward is not immediateMixing different reinforcersUnwanted positive reinforcers

What is positive reinforcement

Photo by Tim Mossholder on Unsplash
In positive reinforcement training, achieving a behavior is associated with pleasant consequences. It doesn't have to be an object, not even a material one.

Eating, stroking, smiling, giving a verbal message, or producing pleasant emotions can be considered positive reinforcers in many contexts.

A mother who congratulates her young daughter every time she uses the toilet correctly promotes learning through positive reinforcement.

The same thing happens when a company gives an economic bonus to its most productive employees, and even gaming winnings can be treated this way. But in psychology, the concept of "positive reinforcement" refers to the difference that follows behavior. Positive reinforcement is the process by which the learner creates associations.

In technical terms, we can say that with positive reinforcement, there is a positive relationship between a particular response and a pleasant stimulus. Awareness of this situation motivates the subject to perform actions in order to receive a reward (or reinforcement).

Method 6: Reinforce behavior change

You reinforce every behavior other than the unwanted one. For example, a child asks you for an expensive gift that you are not going to give, and you have already reported this. And he whines and whines. You do not react to his whining (use the extinction method) At the same time, it is important not to react. No way. And do not demonstrate your reluctance to discuss this or that topic. If, say, when you whine about a gift, you constantly repeat: “I don’t even want to discuss this with you!” or “Well, how much can you whine, you see, I don’t respond to your requests!” - the child sees perfectly well that you react and how! But as soon as the child starts talking about something else, you quickly respond to it. It is important to reinforce the change of topic . Immediately notice this and support it. Don't miss this moment.

Examples of positive reinforcement in the family

Positive reinforcement should be used in doses.

For example, there are many different situations in which parents praise their children. However, for the positive effects of reinforcement to be meaningful, you should not expect a reward for every little thing.

In the long term, it goes without saying that you should clean up after yourself from your desk or put away your trash. However, this does not necessarily mean that praise should not be given at this stage.

See how positive reinforcement works in the family and how it can be implemented in different ways:

In the evening, the child clears the table, even if he is not asked to do so. As a direct consequence, he is allowed to stay awake 10 minutes longer.
Your child is cleaning his room. Then praise him and show him your joy.
If a school report from a teacher is positive, many parents reward their child with money or a toy.

If you want to use positive reinforcement to your advantage, make sure that the appropriate reward comes as soon as possible.

If too much time passes between the action and the reward, the connection is lost and the desired effect (repetition of the behavior) does not materialize.

Positive reinforcement method

Weaning

Karen Pryor also writes about the process of unlearning. Those. when there is some unwanted behavior that you want to get rid of. She gives 8 principles of unlearning. The first four of them are negative, and the second are positive. As you can guess, the second half of the principles work better and produce lasting results.

Kill, delete, get rid of. Simply remove the source or limit it so that it cannot physically perform the unwanted action.
Punishment. Put a child in a corner, hit a dog with a stick, deprive a programmer of a bonus
Negative reinforcement
Extinction. You don't pay attention to unwanted behavior. You don’t reinforce it in any way: neither negatively nor positively.
Development of incompatible behavior. Develop new behavior that will be incompatible with the unwanted one.
Ensure that this behavior occurs on a signal, and then gradually remove this signal
Formation of absence. Anything except unwanted behavior is reinforced.
Change of motivation. Determine why and why the unwanted behavior occurs and try to replace the goal of the behavior with a more necessary/correct one.

PS:

Karen Pryor writes a lot about animal training, but these same principles can be applied just as successfully in our everyday lives. I personally noticed while reading the book how well positive reinforcement worked for me personally. I can say that by mastering the science outlined in the book, you can really get +1 to communication, as announced on the cover of the book.

What is negative reinforcement

Unlike what happens with positive reinforcement, with negative reinforcement, an instrumental response involves the disappearance of an aversive stimulus, that is, an object or situation that prompts the subject to run away or try not to contact it.

From a behavioral point of view, the reinforcement of this procedure is the disappearance or absence of the aversive stimulation. The concept "negative" refers to the fact that the reward is not the receipt of the stimulus, but the absence of it.

With negative reinforcement, an undesirable behavior is prevented from occurring by an aversive stimulus. For example, when a person suffering from agoraphobia deliberately does not use public transport in order to avoid an attack of fear.

The next stage of such learning is the disappearance of the aversive stimulus, which is present until the subject changes the undesirable behavior.

It's like how an annoying alarm clock stops at the touch of a button, like a mother buying her baby something to stop him from crying, or giving her a painkiller when he's in pain.

Now let's talk about some nuances.

Production process

When the subject is already doing what is needed and simply needs to reinforce this behavior, everything is more or less clear.
But what to do if the desired behavior does not exist yet and it is as if there is nothing to reinforce? Development consists of using the slightest tendency towards the desired behavior and step by step moving it towards the goal. Break the final goal into a series of sequential, smaller goals. Find some behavior that is happening now as a first step. It often happens that the subject can perform the desired task (or part of it) by accident. In this case, you need to be sure to notice this behavior and reinforce it. Below are 10 rules of development, which the author examines in detail. A detailed description will not fit within the scope of this article, but you can familiarize yourself with them superficially.

Raise the criterion little by little so that you always have the opportunity to complete what is required and receive reinforcement.
Practice one thing at a time. Don't try to work on several criteria at the same time.
Before moving on to leveling up, reinforce the current one
When introducing new criteria, temporarily relax the old ones
Plan your training program so that you are always ready for dramatic progress in your training
Do not change trainers while developing a specific skill.
If one way of working out doesn't bring success, find another way. A lot of them
Don't finish training without providing positive reinforcement. This is tantamount to punishment.
If a skill deteriorates, quickly go through the entire previous learning process, giving reinforcements
End your workout on a high note. The end of training should be joyful, not sad.

Primary Amplifiers – Satisfying Basic Needs

However, in practice, with reinforcement, not everything is so simple, because many issues are considered subjectively. A very striking example is the opinion that a baby can be “hand trained” if you give him a parental hug at the first cry.

But it is important to remember: in the context of psychology, the main reinforcers are those that are directly focused on the needs of the person.

Hunger and thirst, as well as love and intimacy, are the most important factors for babies and toddlers. However, they should never be made conditional so that children can develop the basic trust they need.

Positive and negative reinforcements can only be used as additional aspects beyond the usual degree of need satisfaction.

There's nothing wrong with after-dinner dessert, sweets, or a hug from your parents.

Method 5. Ensure that unwanted behavior occurs on cue.

And in the future you will stop giving this signal .

There is a parable about a wise old man who valued peace and quiet. A noisy group of children got into the habit of playing near his house. One day the old man came out to the children and gave them a coin, saying that he really liked listening to their cheerful screams. And the next day he gave them a coin again. This went on for some time. And then the old man came out to the children and said that he no longer had money for them. The children replied: “Are we idiots – screaming is free for you?” and left.

The child is noisy and angry. Invite him to make as much noise as possible with you on command. Do this a couple of times on command. First of all, it's fun and unusual. Secondly, such an activity requires a lot of energy and gets tired quite quickly. And then don’t give such a command. Or the child makes a mess in the room and throws his things around. Agree to make as much of a mess as possible in the room within 5 minutes. Perhaps the child did not notice his scattered things at all before. Now he will notice. After he (perhaps with your help) restores order, do not give such commands anymore.

Yes, this requires a certain courage and imagination. Of course, raising children is a challenge and requires creativity.

Secondary reinforcers – the reward is not immediate

Unlike direct need satisfaction, secondary reinforcers are designed to be used only indirectly to individually satisfy a need.

For example, the simplest means at this stage is money. If a person receives a certain amount of money for certain activities, he can later buy something for himself. Again, these could be basic needs: food or clothing.

In families, some parents also use a kind of token system. Positive behavior is marked with an asterisk. If a certain number of stars are collected, the child can choose something from the store.

For example, these could be simple things like eating ice cream after getting five stars or going to the zoo after getting 25 stars.

Method 7: Changing Motivation

This is the best method, but also the most difficult. A change in motivation means that the child no longer wants to do what you consider bad, or wants to do what you consider good. How it works: the child's behavior is related to his needs.

Imagine that your child is irritated and talks to you rudely and boorishly.

And this happens, for example, because you are tired and haven’t gotten enough sleep. Help him organize the right routine, and the irritation will disappear. If his rudeness is due to lack of self-confidence and an attempt to take it out on you, find ways to strengthen his confidence in himself. Or maybe he's being rude because he's upset about a fight with his friends. Support him, show him that you understand his feelings, but don’t bother with advice. This way you can better help him cope with his grief.

Mixing different amplifiers

Many different types of reinforcement are used to facilitate operant learning. They cannot always be classified into a clear category: they are neither negative nor positive.

In general, however, there are three different types of amplifiers:

Material reinforcements.
Social reinforcement: This aspect is characterized by words of praise and recognition. However, a reassuring smile or a friendly nod may be enough.

Photo by Ron Lach from Pexels

Active reinforcements. As a result, the choice is a visit to the zoo, a joint movie evening or a concert.

It is better to avoid material incentives as much as possible.

Reinforcement learning for little ones

This article examines the principle of operation of the machine learning method “Reinforcement Learning” using the example of a physical system. The algorithm for finding the optimal strategy is implemented in Python code using the “Q-Learning” method.

Reinforcement learning is a machine learning method that trains a model that has no knowledge of the system, but has the ability to perform some actions in it. Actions transfer the system to a new state and the model receives some reward from the system. Let's look at how the method works using the example shown in the video. In the description of the video there is code for Arduino, which we implement in Python.

Task

Using the reinforcement learning method, it is necessary to teach the cart to move away from the wall as far as possible. The reward is represented as the value of the change in distance from the wall to the cart during movement. The distance D from the wall is measured using a range finder. Movement in this example is only possible with a certain displacement of the “drive”, consisting of two booms S1 and S2. The booms are two servos with guides connected in the form of an “elbow”. Each servo in this example can rotate through 6 equal angles. The model has the ability to perform 4 actions, which represent the control of two servos, actions 0 and 1 rotate the first servo to a certain angle clockwise and counterclockwise, actions 2 and 3 rotate the second servo to a certain angle clockwise and counterclockwise. Figure 1 shows a working prototype of the cart.

Rice. 1. Cart prototype for machine learning experiments

In Figure 2, boom S2 is highlighted in red, boom S1 is highlighted in blue, and 2 servos are highlighted in black.

Rice. 2. Engine system

The system diagram is shown in Figure 3. The distance to the wall is indicated by D, the range finder is shown in yellow, and the system drive is highlighted in red and black.

Rice. 3. System diagram

The range of possible positions for S1 and S2 is shown in Figure 4:

Rice. 4.a. Boom position range S1

Rice. 4.b. Boom position range S2

The limit positions of the actuator are shown in Figure 5:

When S1 = S2 = 5 the maximum distance from the ground. When S1 = S2 = 0, the minimum distance to the ground.

Rice. 5. Border positions of booms S1 and S2

The “drive” has 4 degrees of freedom. Action changes the position of arrows S1 and S2 in space according to a certain principle. The types of actions are shown in Figure 6.

Rice. 6. Types of Actions in the system

Action 0 increases the value of S1. Action 1 decreases the value of S1. Action 2 increases the value of S2. Action 3 decreases the value of S2.

Movement

In our problem, the cart is set in motion in only 2 cases: In the position S1 = 0, S2 = 1, action 3 sets the cart in motion from the wall, the system receives a positive reward equal to the change in the distance to the wall. In our example, the reward is 1.

Rice. 7. System movement with positive reward

In the position S1 = 0, S2 = 0, action 2 moves the cart towards the wall, the system receives a negative reward equal to the change in the distance to the wall. In our example, the reward is -1.

Rice. 8. Movement of a system with negative reward

In other states and any actions of the “drive” the system will stand still and the reward will be equal to 0. I would like to note that the stable dynamic state of the system will be the sequence of actions 0-2-1-3 from the state S1=S2=0, in which the cart will move in a positive direction with a minimum number of actions spent. Knee raised - knee straightened - knee lowered - knee bent = cart moved forward, repeat. Thus, using the machine learning method, it is necessary to find such a state of the system, such a specific sequence of actions, the reward for which will not be received immediately (actions 0-2-1 - the reward for which is 0, but which are necessary to receive 1 for the subsequent action 3 ).

Q-Learning method

The basis of the Q-Learning method is the weight matrix of the system state. Matrix Q is a collection of all possible states of the system and weights of the system’s response to various actions. In this problem, the possible combinations of system parameters are 36 = 6^2. In each of the 36 states of the system, it is possible to perform 4 different actions (Action = 0,1,2,3). Figure 9 shows the initial state of matrix Q. Column zero contains the row index, the first row contains the value of S1, the second – the value of S2, the last 4 columns are equal to the weights for actions equal to 0, 1, 2 and 3. Each row represents a unique state of the system. When initializing the table, we equate all weight values to 10.

Rice. 9. Initialization of matrix Q

After training the model (~15000 iterations), the matrix Q has the form shown in Figure 10.

Rice. 10. Matrix Q after 15000 training iterations

Please note that actions with weights equal to 10 are not possible in the system, so the value of the weights has not changed. For example, in the extreme position with S1=S2=0, actions 1 and 3 cannot be performed, since this is a limitation of the physical environment. These boundary actions are prohibited in our model, so the algorithm does not use 10k.

Let's consider the result of the algorithm: ... Iteration: 14991, was: S1=0 S2=0, action= 0, now: S1=1 S2=0, prize: 0 Iteration: 14992, was: S1=1 S2=0, action= 2, now: S1=1 S2=1, prize: 0 Iteration: 14993, was: S1=1 S2=1, action= 1, now: S1=0 S2=1, prize: 0 Iteration: 14994, was: S1 =0 S2=1, action= 3, now: S1=0 S2=0, prize: 1 Iteration: 14995, was: S1=0 S2=0, action= 0, now: S1=1 S2=0, prize: 0 Iteration: 14996, was: S1=1 S2=0, action= 2, now: S1=1 S2=1, prize: 0 Iteration: 14997, was: S1=1 S2=1, action= 1, now: S1 =0 S2=1, prize: 0 Iteration: 14998, was: S1=0 S2=1, action= 3, now: S1=0 S2=0, prize: 1 Iteration: 14999, was: S1=0 S2=0 , action= 0, now: S1=1 S2=0, prize: 0

Let's take a closer look: Let's take iteration 14991 as the current state. 1. The current state of the system is S1=S2=0, this state corresponds to a row with index 0. The largest value is 0.617 (we ignore values equal to 10, described above), it corresponds to Action = 0. This means, according to the matrix Q, when the system state is S1=S2 =0 we perform action 0. Action 0 increases the value of the rotation angle of servo S1 (S1 = 1). 2. The next state S1=1, S2=0 corresponds to a line with index 6. The maximum weight value corresponds to Action = 2. We perform action 2 – increase S2 (S2 = 1). 3. The next state S1=1, S2=1 corresponds to a line with index 7. The maximum weight value corresponds to Action = 1. We perform action 1 – decrease S1 (S1 = 0). 4. The next state S1=0, S2=1 corresponds to a row with index 1. The maximum weight value corresponds to Action = 3. We perform action 3 – decrease S2 (S2 = 0). 5. As a result, we returned to the state S1=S2=0 and earned 1 reward point.

Figure 11 shows the principle of choosing the optimal action.

Rice. 11.a. Matrix Q

Rice. 11.b. Matrix Q

Let's take a closer look at the learning process.
Q-learning algorithm
minus = 0; plus = 0; initializeQ(); for t in range(1.15000): epsilon = math.exp(-float(t)/explorationConst); s01 = s1; s02 = s2 current_action = getAction(); setSPrime(current_action); setPhysicalState(current_action); r = getDeltaDistanceRolled(); lookAheadValue = getLookAhead(); sample = r + gamma*lookAheadValue; if t > 14900: print 'Time: %(0)d, was: %(1)d %(2)d, action: %(3)d, now: %(4)d %(5)d, prize : %(6)d ' % \ {"0": t, "1": s01, "2": s02, "3": current_action, "4": s1, "5": s2, "6": r} Q.iloc[s, current_action] = Q.iloc[s, current_action] + alpha*(sample - Q.iloc[s, current_action] ) ; s = sPrime; if deltaDistance == 1: plus += 1; if deltaDistance == -1: minus += 1; print( minus, plus )

Full code on GitHub.

Set the initial position of the knee to the highest position:

s1=s2=5.
We initialize the matrix Q by filling it with the initial value: initializeQ(); epsilon
parameter . This is the weight of the “randomness” of the algorithm’s action in our calculation. The more training iterations have passed, the fewer random action values will be selected: epsilon = math.exp(-float(t)/explorationConst) For the first iteration: epsilon = 0.996672 Save the current state: s01 = s1; s02 = s2 Let's get the “best” action value: current_action = getAction(); Let's take a closer look at the function.

The getAction() function returns the action value that corresponds to the maximum weight given the current state of the system. The current state of the system in the matrix Q is taken and the action that has the maximum weight is selected. Please note that this function implements a random action selection mechanism. As the number of iterations increases, the random choice of action decreases. This is done so that the algorithm does not get stuck on the first options found and can take a different path, which may turn out to be better.

In the initial initial position of the arrows, only two actions 1 and 3 are possible. The algorithm chose action 1. Next, we determine the row number in the matrix Q for the next state of the system, to which the system will go after performing the action that we received in the previous step.

setSPrime(current_action); In a real physical environment, after performing an action, we would receive a reward if movement followed, but since the movement of the cart is simulated, it is necessary to introduce auxiliary functions to emulate the reaction of the physical environment to actions. (setPhysicalState and getDeltaDistanceRolled() ) Let's execute the functions: setPhysicalState(current_action); — we simulate the reaction of the environment to the action we have chosen. We change the position of the servos and move the cart. r = getDeltaDistanceRolled(); — We calculate the reward - the distance traveled by the cart.

After executing an action, we need to update the coefficient of that action in the Q matrix for the corresponding system state. It is logical that if the action led to a positive reward, then the coefficient in our algorithm should decrease by a smaller value than with a negative reward. Now the most interesting part is that to calculate the weight of the current step, let’s look into the future. When determining the optimal action to be performed in the current state, we select the largest weight in the Q matrix. Since we know the new state of the system to which we have transitioned, we can find the maximum weight value from the Q table for this state:

lookAheadValue = getLookAhead(); At the very beginning it is equal to 10. And we use the value of the weight of the action that has not yet been performed to calculate the current weight. sample = r + gamma*lookAheadValue; sample = 7.5 Q.iloc[s, current_action] = Q.iloc[s, current_action] + alpha*(sample - Q.iloc[s, current_action] ) ; Q.iloc[s, current_action] = 9.75 That is we used the weight value of the next step to calculate the weight of the current step. The greater the weight of the next step, the less we will reduce the weight of the current one (according to the formula), and the more preferable the current step will be next time. This simple trick gives good convergence results for the algorithm.

Scaling the algorithm

This algorithm can be extended to a larger number of degrees of freedom of the system (s_features), and a larger number of values that the degree of freedom (s_states) takes, but within small limits.
Quite quickly, the Q matrix will take up all the RAM. Below is an example of code for constructing a summary matrix of model states and weights. With the number of “arrows” s_features = 5 and the number of different arrow positions s_states = 10, the matrix Q has dimensions (100000, 9). Increasing degrees of freedom of the system

import numpy as np s_features = 5 s_states = 10 numActions = 4 data = np.empty((s_states**s_features, s_features + numActions), dtype='int') for h in range(0, s_features): k = 0 N = s_states**(s_features-1-1*h) for q in range(0, s_states**h): for i in range(0, s_states): for j in range(0, N): data[k, h] = ik += 1 for i in range(s_states**s_features): for j in range(numActions): data[i,j+s_features] = 10.0; data.shape # (100000L, 9L)

Conclusion

This simple method shows the “miracles” of machine learning, when a model, without knowing anything about the environment, learns and finds the optimal state in which the reward for actions is maximum, and the reward is not awarded immediately for any action, but for a sequence of actions.
Thank you for your attention!

Pros of Operant Training for Dogs

As you can see, within the operant method, the central and active part of learning is the dog itself. In the process of learning this method, the dog has the opportunity to draw conclusions, control the situation and manage it.

A very important “bonus” when using the operant training method is a “side effect”: dogs that are accustomed to being active participants in the training process become more proactive and self-confident (they know that, in the end, everything works out for them, they are in charge world, can move mountains and turn back rivers), they increase self-control and the ability to work in frustrating conditions. They know: even if it doesn’t work out right now, it’s okay, stay calm and keep doing – keep trying, and a reward awaits you!

A skill that is mastered by the operant method tends to be consolidated faster than a skill that is practiced mechanically. That's what the statistics say.

Now I work only with soft methods, but my previous dog was prepared using contrast (carrot and stick method) and mechanics. And I’ll be honest: it seems to me that positive reinforcement, when we actively encourage the right behavior and ignore (and try to prevent) the wrong one, gives a stable result a little later than the mechanical approach. But... I vote with both hands for working with soft methods, because the operant method is not only training, it is an integral system of interaction, the philosophy of our relationship with the dog, which is our friend and, often, a full-fledged member of the family.

I prefer to work with the dog a little longer, but in the end get a pet that is gushing with energy, ideas and a sense of humor, and has retained its charisma. A pet with whom the relationship was built on love, respect, desire and interest in working with me. A pet who trusts me unconditionally and who is eager to work with me. Because it is interesting and fun for him to work, it is interesting and fun for him to obey.

Read more

: Shaping as a method of training dogs.