by Yuval Noah Harari.
All of the following notes are paraphrased from the book "Sapiens" except the parts prefixed by "My thoughts"
Belief. Yuval says that sapiens have the unique, unrivaled ability to believe in fables and contradictions. He says that government, laws, money, and religion are fabels--they don't exist outside of human consciousness. Our ability to believe in them is what differentiates sapiens from other species.
My thoughts: People's beliefs are incredibly impactful to how they live and experience life. Our beliefs (what he calls fables) impact our goals, actions, perceptions, societies, happiness... This thought occurred numerous times while reading this book.
Happiness. Repeatedly, Yuval states and assumes that happiness is the ultimate standard (measurement) of life. He questions how our progress has impacted our happiness, regardless of whether progress has increased our wealth and health. He claims that the following is backed up by plenty of research: the most important predictor of happiness is a strong, close community--family, and a good marriage. Our current society has made us more independent, and diminished many of those strong family units that sapiens have historically relied upon. It has potentially made us less happy than we were in the past.
The addmitance of ignorance, science, and progress. The author says that before the scientific revolution, we used to believe we knew everything that was important to know (handed to us through religion). We used to believe that wealth and progress were stagnant, at best, but many believed in the "good ol' days" and that society had regressed over time, not progressed. When the scientific revolution came about, we realized that we had been wrong, and that we were ignorant. We relaized that there are many more important things to discover. We began believing in progress, that things could get better through time. With the belief in progress came the substantial increased use of credit (the future will be better, so I'll give up money now to get more later) and the result was a boom in growth.
My thoughts: A recurring thought I had while listening to this book: for a history book, im shocked at how much opinion in this book.
These are my notes on the book "Contagious" by Jonah Berger.
"Word of mouth" is the most important avenue of spreading content (ideas, products, ...). Advertising and finding "market leaders" is much less important1.
Getting people to share your product with each other is the best marketing you can do. Why?
Jonah says that things catch on for 6 reasons or STEPPS. He notes that contagious content doesn't need all six of these things, but that each of these contribute to how contagious it is.
Footnotes
1: Jonah demphasizes the importance of traditional advertising or finding "market leaders" (eg celebrities) to spread products. Sure, advertising gets products in front of people, but people tend to believe product reviews from their friends and discount product praisings on TV. And yes, marketleaders may have tens or even thousands the number of social connections that a typical person has, but (1) even if the market leader got every one of their followers to use your product, you'd have a small portion of the market. You need their followers to spread your product too. And (2) Market leaders tend to have less influence on their followers then people have on their friends and family. People are skeptical of whether market leaders (eg a celebrity) is spreading a product because she really believes the product is great, or because a company is paying her to say good things about the product.
2: Choices indicate identity. We make educated guesses about people based on the clothes they wear, car they drive, and especially, what they talk about. People choose to share certain things and not others in order to make themselves appear good. When we share content and are admired, we have used that content as "social currency" to gain admiration.
The essence of this book is that, to succeed, you should:
And, this book's answer to my question: "what do managers do?" Managers help their team work better--they help their team develop, feel good about themselves, and help their team deliver better results to their organization.
Good managers:
The New One Minute Manager has three "secrets": One minute goals, one minute praisings, and one minute redirects.
Help employee to identify a few goals. Their goals should be clear, concise, key/important. Don't list all the things you want to achieve--just the few (1-5) most important. Each goal should be able to be read within a minute.
Employees should reflect on their goals regularly (couple times a week?) to evaluate: how they're doing on approaching their goals? They ask: how are my actions leading me towards my goals?
Look for something right
All too often, managers are looking for how their employees are doing things wrong. They seek to find their employees messing up so they can correct their course of action.
Instead, managers should look for things the employees are doing right, especially at the beginning of a new task. If the employee is doing something right, or doing something approximately right, the manager should give their employee concise, specific praise as soon as possible after the event. The manager only needs to do this at the beginning of a new task or role. Once the employee has developed the correct behavior, they will know they're doing it correctly and will be praising themselves.
Imagine trying to teach a baby to walk, but only praising them if they successfully stood up and walked across the room. The baby would never make it. Instead, you praise the baby when they stand, when they take their first few steps, and when they make it all the way across, even if they don't do it exactly correct until the end. You praise them for their improvement.
If an employee makes a mistake, you give them a one minute redirect.
It's important to separate the bad behavior from the person. You are being tough on the bad behavior, but supportive of the person.
A person's worth is not equal to their behavior.
by Thomas Sowell
Economics is the study of scarce resources with alternative uses.
Scarcity
There are many ways to allocate scarce resources (price, appointment, random, timeshare, ...). This book advocates allocating scarce resources through market prices. The book is not against taxes or helping the poor, but is against manipulating prices of products, services, or wages to "help" anything.
Market Prices - where prices are free to fluctuate with supply and demand. As supply increases, prices tend to drop. As demand increases, prices tend to raise. Market prices communicate a complex system of supply and demand.
Using market prices to allocate scarce resources has the following primary benefits:
This book operates on the fact that people operate on personal incentives--that people are by nature at least somewhat concerned for their own well-being. When you manipulate people's incentives, you manipulate their actions. If you manipulate incentives in the economy, you manipulate the economy.
Within the last few months, I've changed my perspective about where I want to take my career.
Previously my goal was to work as a data scientist. I think data, graphs, predictions, and understanding things through data is really powerful and interesting. I enjoy reading about inference, using tools to make inferences, and communicating conclusions. The whole field is practically useful and technologically intriguing to me.
I was driven to work as a data scientist. I read several books and papers, I practiced using different data sets, and I received advice from senior scientists in the field. I worked hard and put in a lot of time learning and practicing so I could work as a data scientist.
But I had a few major problems with my path toward data science.
Where software engineering shines: I get mostly practical and purpose-related value from writing software. When I automate, I feel like I am contributing to the world by reducing mundane work. I feel like I am giving people time to do what they really find important. When I build tools that increase human abilities (for instance increasing my memory with my journaling software) I empower people to lead richer lives and be more effective at what they are passionate about.
Where data science shines: For one, I feel the joy like a child as I learn about cool statistics tools1. I am curious and find joy in learning and reading about useful inference methods. More importantly, data science helps us understand and derive direction in ambiguous situations. It helps us answer difficult questions, and make good decisions. Data science is powerful because it helps us understand what we should do--it helps us understand what is important and where we should spend our resources2.
Ultimately, I think I would be happy doing either--both have their perks. Both data science and software engineering work together to make positive impacts. What really influenced my plan above all else was my value for my relationships and time. I don't want to spend the next several years working hard just to start over at the bottom of the data science ladder if I will be equally happy building software. I especially don't want to do this if it means I'm sacrificing time with my fiance. Since I'm further ahead in Software Engineering, I'll just embrace it and make the best of it.
Robotics is an appealing future career that I've started working towards. It's not the only possibility, but it's the best one I've found yet. In particular, I have been thinking about what kind of robots could save people time. I'm passionate about this since time is something I always feel short on. Saving people an hour a day could make a huge positive impact in their lives.
To get there, my plan is to:
1 However, I suspect this joy wouldn't last forever. A firework isn't a marvelous the 100th time you see one explode. Powerful statistical methods will also lose their novelty and luster after using them so many times. In fact, I used to get the same child-like joy when writing software. Now, as I write more and more of it, I see it more as a useful tool than a hobby.
2 Data science doesn't have a monopoly of influencing decisions. Data science is a way of turning lots of data into context for making decisions. Software engineering also involves gathering relevant information that impacts decision making. You must understand your customers and their needs so you can deliver a product that is most valuable to them. Part of this is good communication, part of this can be big statistical number crunching with models and inference (data science).
The purpose of this book is to help people and organizations achieve those goals they sincerely desire but have not been able to achieve.
Note: see the "Summary of Summaries" section below. Unlike most books I blog about, I gathered this information from reading others' notes rather than reading the book myself.
The ITC method in a nutshell: Immunity to change is caused by internal conflict--when you have beliefs that oppose your goals. Reflect to find your hidden beliefs that oppose your goal. Resolve the internal conflict by (1) understanding your beliefs fully and (2) picking a side once you have all the data: change the beliefs or change the goal.
Kegan and Lahey say that there are three different levels of complexity of the mind:
Note: In communication and software engineering, I see complexity as something to avoid. I seek to simplify my software so it is easy to understand and maintain. When I write simply and concisely, I allow a broader audiance to understand me. I don't know why the authors chose the word "complexity" for their "levels" of the mind, but I'm not a fan.
The highest maturity of the mind, then, is when we reflect on the limits of our own belief systems and agendas. This reflection and understanding approach is exactly what the authors advocate to eliminate immunities to change.
The ITC method:
There's also a four-column exercise:
Column 1 - Write your commitment
Column 2 - List everything you are doing/not doing that works against your commitment
Column 3 - Write down what you think your competing commitment(s) might be
Column 4 - Write the underlying assumption you are making about why the competing commitment is important
Now that you've identified your inner conflict, you can determine how best to move forward. At this point I'd dive deeper into each side of the conflict, examine the foundations, beliefs, and data, and probably be able to pick a side and change my perspective after the investigation.
Goal | Counter productive behaviors | Underlying competing commitments | Underlying beliefs |
---|---|---|---|
I want to get stronger--I want to be able to lift more weight and be thicker and more muscular. | I fast, do a lot of cardio, and keep my calories pretty low much of the time. I don't eat enough calories to allow myself to build the muscle I want. | I strive to stay lean. | I have a fear of letting myself get fat. I respect myself more, have higher confidence, and feel better when I'm lean and cardiovascularly healthy. However I also feel tired when I'm constantly consuming too few calories. I have discovered multiple times that my body just doesn't want to stay "cut". I can get cut from time to time, but from my experience, remaining cut means keeping calories low and feeling tired all day every day. When I lack energy because I'm eating too little, my work suffers, my relationships suffer, and other parts of my life suffer except my self-image relating to my lean physique. |
A different method which I believe may be more optimal: rotate between bulking and cutting. Spend time eating slightly over calories to grow, then spend time eating under calories (I enjoy keto and PSMF) to shred the fat off. Repeat. As long as I'm not cutting for too long, my energy stays high. It's when I am riding below calories for weeks and months at a time that I begin getting sluggish. With this method I'll (1) gain muscle (2) keep my body fat in the healthy to low range (but I wont stay very lean all the time) (3) have high energy (4) enjoy fasting and small periods of lower calories from time to time. Win-win, all I have to give up is being lean 24/7 and I can have all these benefits!
I attended a gender diversity conference at Amazon last week and the speaker on my favorite talk recommended this book (and a few others which I'll likely read soon). The speaker spoke intelligently, clearly, and persuasively about how to be persuasive. She gave clear reasoning of her beliefs and amazed me with her ability to take differing standpoints on issues depending on the situation (she gave a member of the audience two very different pieces of advice: one was saying here's something you can do to improve the situation, and one was telling the questioner that the questioner essentially had the wrong perspective. The speaker enlightened the questioner of the other facts surrounding the issue). I was impressed by the speaker's fidelity to the data and her lack of interest in pleasing other people. She earned my respect pretty quickly, and so I wanted to read a few books from her list so hopefully I can learn to be wiser and more effective like her.
This post is different than my other "book" posts because I didn't actually read this book. I started to read it and was having a very difficult time focusing on what the speaker was saying. I picked up next to nothing by time the second chapter was over, so I returned the book and read some notes instead. This post is my notes from reading others' notes. I read William Harryman's notes and an Immunity to Change case study pdf that appears to be from mindsatwork.com.
In today's world, we have many decisions to make and so much data flying at us. We can't make careful decisions about everything. We must employ patterns or shortcuts to reduce the cognitive load of making decisions so we can do more of what is important in our lives.
For instance, rather than making a careful decision about what to eat today, we use some shortcuts/patterns to guide our decisions:
Notice that none of these shortcuts involve analysing and comparing the inherent details of the products; they use shortcuts to determine value. These shortcuts/behavior patterns are both useful and dangerous. They're useful because they really do cut down on some of the work we have to do when making a decision. We can make quicker decisions, and thus make more decisions and have more time to do what's important if we can cut down on the time needed for each decision. They're dangerous because sometimes they result in sub-optimal or even harmful decisions.
Ex: The bystander effect shows how we use social proof to determine what to do when we are uncertain. There are numerous of examples of strangers in life-or-death emergency crises, and lots of bystanders looking and not helping. The problem is that the bystanders are uncertain whether the stranger needs help or whether they are okay. At this point the bystanders look to each other to determine whether or not it's an emergency. Since no one is helping, the stranger we are uncertain about (in crisis) must not need help. Usually social proof works well, but sometimes it malfunctions.
Ex: As seen all too often, compliance professionals will craft artificial scarcity through "limited time offers" and "limited supplies" and "exclusive information." If we believe something is scarce, we will value it more.
We don't want to totally stop using these shortcuts as we will loose all the benefits. But we don't want to use them all the time without thinking because we will get manipulated. The way to avoid being manipulated through these shortcuts is to:
I'm interested in learning more about how to communicate my ideas in a clearer, more persuasive manner. I picked up this book that was related, but it took me in a pretty different direction. I learned much more about what I wanted to in a 50 minute talk about persuasion at an amazon conference this last week. The thing I was missing that I learned from that talk is that there will always be pushback. People will always resist what you try to get them to do. They are comfortable and people resist change. That's expected and always happens. You gotta keep pushing.
Anyway, this book was still good; I learned some important things from it--mostly about how to avoid being manipulated (which I despise).
For there is nothing either good or bad, but thinking makes it so.
— William Shakespeare
There's two primary takeaways that Dr Burns reiterated several times throughout this book:
Some successful techniques for dealing with and/or identifying these distortions and fixing them:
Tip for dealing with criticism (pros: 1. makes people less aggressive and "takes the wind out of their sails" because they expect you to play defensive and want to fight 2. gives you opportunities to see your mess-ups as mess-ups and not catastrophic failures.. lets you improve and grow):
They did a bunch of experiments--cognitive therapy is at least as effective (maybe more) than psychoactive drugs for depression. Cognitive therapy is also the best treatment for anxiety and is successful at treating many other mental issues.
More experiments--cognitive therapy has many of the same effects on the physical brain as the drugs. Bibliotherapy (read this book) produced more long lasting results and had as good as success rate as drugs. Bibliotherapy also had a much much lower dropout rate (people quitting therapy).
Cognitions/thoughts change the architecture of your brain.
Cognitions/thoughts/beliefs/perceptions or how we interpret things determines our mood. What we think determines how we feel.
But thinking isn't just arbitrary, at some point our thoughts come from somewhere. So can we surround ourselves with good environments, reminders, habits, and books to have better thoughts?
Only your thoughts can change how you feel; what other people think cannot affect you. Experiment: the psychologist said he would think one really nice thought about the patient, and one really nasty thought about the patient. He closed his eyes, and preceded to think them. He asked the patient how his thoughts changed his mood, but the patient had no idea when the psychologist was thinking what. It's what you think (that others think) that can make you feel bad. It's what you think that makes you feel.
I listened to the audio book, Steve Jobs, over the last month or so.
Not surprisingly, this book was almost entirely about Steve Jobs' role at Apple. It also covered Pixar as well as some smaller snippets about things more personal to Steve Jobs like his philosophy, dietary beliefs, and family.
Steve was good at "turning off the noise". At apple there were hundreds or thousands of product ideas. He insisted that they refined to just 2 or 3 to focus on and turn off the rest.
Steve was repeatedly quoted and portrayed as not driven by profit, but driven by making great products. I've been thinking about this distinction ever since I heard about type-B corporations. I don't want to work for a company that is driven by monetary profits as the #1 goal. Profits aren't the end, profits are the means to doing something great. We need profits so we can reach more people and develop better tools and services. But having money as the purpose to my work is unfulfilling and draining. I want to work towards improving something beyond the retirement funds of the investors of the company I work for.
People can't be experts at everything. There isn't enough time in the day. This is one motivation for why Steve made apple products so locked down and simple. He wanted to control everything so it all "just worked" and was awesome. Even though I don't use apple products, I completely agree with this point. If I was an auto mechanic and had a family, I wouldn't want to spend hours figuring out how to make my computer do what I wanted it to do. I'd just want it to work so I can do my job well and be with my family. I'd avoid computers that ate my time. I'm not an auto mechanic, I'm a software engineer--my job is to control computers. I need to understand how to manipulate computers, so I spend time getting into the nitty grity. But I don't spend time learning about my car. I just want my car to work.
Steve was known to be mercurial: he hated or loved things, he thought you "were shit" or "a genius." Most ideas "were shit" to him, but later that week after calling your idea shit, he'd communicate it to others as if it was his own. He was successful at leading and innovating, but you don't have to be an asshole to be a successful businessperson. Tim Cook was just as good at negotiating as Steve, maybe better, but he and Steve were opposites. Whereas Steve was mercurial, Cook was cool as steel.
One quality that did help Steve more often than not was his reality distortion field. He had warped beliefs of reality, believing things could be done perfectly. His beliefs were contagious when you were around him (reality distortion field). He convinced people to do things they didn't think were possibly because he believed they were possible (eg making the GUI with overlaid windows).
Here's to the crazy ones. The misfits. The rebels. The troublemakers. The round pegs in the square holes. The ones who see things differently. They're not fond of rules. And they have no respect for the status quo. You can quote them, disagree with them, glorify or vilify them. About the only thing you can't do is ignore them. Because they change things. They push the human race forward. And while some may see them as the crazy ones, we see genius. Because the people who are crazy enough to think they can change the world, are the ones who do.
some notes
Steve jobs started as an adopted kid with really loving parents. His parents encouraged him to learn. He soon realized he was smarter than his parents. Thought of himself as special (and his parents thought he was special). Played a lot of pranks in school. Met Waz in HS. Spent a lot of years "finding himself" looking for enlightenment, doing LSD at a liberal arts college. Steve was at the intersection of humanities and tech. Waz was an incredible engineer. They built "blue box" together to call on pay phones for free. Gave them confidence. Then they built the Apple 1 which sold pretty well. Then built apple 2 after finally a few people saw past their long hair and smelly bodies and saw that they had a great product.
When turning from a partnership from incorporation with their first investor ($250k), they had three principles: (paraphrased) 1) understand the customer and their desires/needs better than anyone else 2) eliminate what's not important so you can focus on the things that matter 3) impute...frame the product and yourselves as you want people to perceive, let them impute the value and characteristics from the appearance...people do judge books by their covers, have the right cover.
During and after college, I worked for SentiMetrix for a year as a Data Scientist and Software Engineer. I found almost all my work interesting, innovative, and educational. About five months ago, they let go about 2/3rds of the engineers due to financial hardships, including myself. I started working at Amazon as a Software Development Engineer (SDE).
If you talked with any of my friends during college, they could have told you that the last place I wanted to end up was a big company. I didn't want to be a "cog" in a big machine--where I felt insignificant around thousands of engineers just like me. I wanted to work on smaller things, with smaller groups of people, where I could have a substantial impact.
Since working at Amazon, I have gained a new perspective. Not only did the "cog" perspective melt away for reasons explained below, but I found many other great aspects about writing software for Amazon.
One thing I failed to realize as a college student is that working at a big company doesn't have to mean doing the same thing as everyone else as interchangeable parts. I work in a team of a handful of people. My team owns several small but significant modules (sub pieces) of the Amazon machine. We have a lot of expertise with our systems and related systems.
Ownership: Amazon stresses ownership. Teams own their modules/services/products. Working for my team at amazon is like working for a very small company. We have to convince others to use our product, and we have to support it when it's having issues. However we also get the benefits of working for a large company: everyone is under the same umbrella. Some advantages are that we can trust our users a lot more, and we can communicate much more freely.
Replaceable: You can't simply replace an engineer that's been on my team for five years with another software engineer that's worked elsewhere at amazon for five years. We have expertise in our domains, in our systems. We become specialized. It would take another engineer years to to be as effective as the original engineer. Even within our team, we develop specialties.
Impact: Not only do I have a significant impact on my team and on its products, but my team's products/services have a significant impact on Amazon, and therefore on a large amount of people. I have the privilege of pioneering a new product with an even smaller subset of my small team. When I come up with ideas or find problems in our design, I am having a significant impact on my product and my many future users. I have a large impact on a small team, just like I wanted to.
Amazon has leadership princples. When I first read them, I was excited to see them. I'd love to be surrounded by people that share these principles. However I was a bit skeptical. I thought they might be taken in the company as just motivational mumbo jumbo...like motivational posters.
After working there for five months, I've realized that for most people at amazon, these principles aren't just motivational mumbo jumbo, but they are active principles to work by. For my interviews, they asked me to prepare by studying the leadership principles and find examples of them in my life. I've also taken interviewer training at amazon--the principles are a significant factor in hiring. We use the principles when designing new software and when maintaining old software.
I think our principles are pretty cool, and so I'd like to share some thoughts on them:
When I first joined, my manager sent me to analyticon because of my interest in data science. I got to speak with many research scientists about their careers and experience. There are many senior employees at amazon that enjoy talking with and mentoring colleagues to help them grow.
There's also tons of online resources. There's resources on using internal products, advancing your career, learning new skills, and learning from others mistakes and experiences.
My biggest qualm I have about my current job is one that I hear other engineers have--I don't feel passionate about online retail. It's a great platform to learn software engineering on, and Amazon is a great company that I am thankful to be a part of, but the customer-facing results of the work I do don't improve my sense of life purpose. The positive impact I currently have on the world is not one I get very excited about.
I also miss reading research papers at work. I still read research in my free time, but I think I'd find it very gratifying to combine my engineering skills and my interest in research to build incredible innovations.
I'm working on being able to solve both of these cons while at Amazon, and keep all of the pros. Amazon has other departments; we have AWS (sweet) and Robotics (awesome). I am continuing to work on my skills so I have the option of transferring to Robotics in a few years.
Overall I really like working at Amazon. I feel proud to be a part of such a strong and intelligent team, and I am thankful for the ways they help build me and create a great environment for me to grow in.
This post contains my notes and thoughts on the paper Human-level control through deep reinforcement learning.
This paper is from DeepMind. The team writes about an algorithm which successfully plays Atari games such as breakout, boxing, and pong. In fact it plays many of these games better than professional human players can. What's remarkable about this paper, however, is that their algorithm receives only images of the game and the score as input.
This paper uses a Markovian Decision Process (MDP) algorithm called Q-learning to automatically learn a function which can play games. At each time step, given the state of the game (an image and a score)1, the algorithm chooses the action it believes will maximize its cumulative reward (game score)2. The cumulative reward is discounted at times further in the future, meaning, to some, configurable extent, given two rewards of equal value, the sooner reward is more important than the later reward.
For technical reasons (correlation of states, aka the instability problem), the authors introduced what they call expereince replay to their algorithm. Experience replay randomly selects states from the past and learns from them. The idea was inspired by biology. Experience replay allowed this algorithm to be complex (a neural network) whereas without experience-replay, past papers have had to use much simpler algorithms.
Another thing this paper did to solve the instability problem was only periodically update the neural network. Essentially, they'd run the algorithm for several steps--feed an image into the neural network, it outputs an action, input the action into the Atari emulator, repeat. Then after many iterations of that they'd update the neural network's parameters to account for what it had learned over those steps (using RMSProp (back propagation)).
1: actually, the state they used was a sequence of 4 frames/images from the game and the score.
2: actually, the people at deepmind decided to feed changes in score to the algorithm and they clipped all positive changes in scores +1 and all negative changes in scores to -1 because it helped the algorithm converge to an optimal solution better.
Their algorithm, "receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games" (1).
"We set out to create a single algorithm that would be able to develop a wide range of competencies on a varied range of challenging tasks--a central goal of general artificial intelligence that has eluded previous efforts" (1).
DQN - deep Q-network; the algorithm that is the topic of this paper
They use a deep convolutional network (multiple convolutional layers) which builds "robustness to natural transformations such as changes of viewpoint or scale" (1).
"We consider tasks in which the agent interacts with an environment through a sequence of observations, actions and rewards. The goal of the agent is to select actions in a fashion that maximizes cumulative future reward" (1).
"We use a deep convolutional network to approximate the optimal action-value function" (1).
$$ Q^*(s,a) = \max_\pi\mathbb{E}[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \dots|s_t=s, a_t=a, \pi] $$
Note: See section below on Q-learning and Reinforcement Learning to understand what this means.
Note: The paper uses \(\mathbb{E}\) to represent the "expected value" (average or mean).
The instability problem
"Reinforcement learning is known to be unstable or even to diverge when a nonlinear function approximator such as a neural network is used to represent the action-value (also known as Q) function. This instability has several causes: the correlations present in the sequence of observations, the fact that small updates to Q may significantly change the policy and therefore change the data distribution, and the correlations between the action-values...and the target values" (1).
This paper's novel solution to the instability problem
"First, we use a biologically inspired mechanism termed experience replay that randomizes over the data, thereby removing correlations in observation sequences and smoothing over changes in the data distribution. Second, we used an iterative update that adjusts the action-values (Q) towards target values that are only periodically updated, thereby reducing correlations with the target" (1).
"To perform experience replay we store the agent's experiences \(e_t = (s_t, a_t, r_t, s_{t+1}\) at each time-step \(t\) in a data set \(D_t = \{e_1,\dots ,e_t\}\). During learning, we apply Q-learning updates on samples of experience drawn uniformly at random from the stored samples," D (1).
"Our method was able to train large neural networks using a reinforcement learning signal and stochastic gradient decent in a stable manner" (2).
"Our DQN method outperforms the best existing reinforcement learning methods on 43 of the games without incorporating any of the additional prior knowledge about Atari 2600 games used by other approaches. Furthermore, our DQN agent performed at a level that was comparable to that of a professional human games tester across the set of 49 games, achieving more than 75% of the human score on more than half the games" (2).
The paper used an algorithm called "t-SNE" to visualize "the representations learned by DQN" (3).
"Games in which DQN excels are extremely varied in their nature, from side-scrolling shooters (River Raid) to boxing games (Boxing) and three-dimensional car-racing games (Enduro)" (3).
"DQN is able to discover a relatively long-term strategy (for example, Breakout: the agent learns the optimal strategy, which is to first dig a tunnel around the side of the wall allowing the ball to be sent around the back to destroy a large number of blocks...). Nevertheless, games demanding more temporally extended planning strategies still constitute a major challenge for all existing agents including DQN" (4).
"In this work, we demonstrate that a single architecture can successfully learn control policies in a range of different environments with only very minimal prior knowledge, receiving only the pixels and game score as inputs" (4).
"Our approach incorporates 'end-to-end' reinforcement learning that uses reward to continuously shape representations within the convolutional network towards salient features of the environment that facilitate value estimation" (4).
"The successful integration of reinforcement learning with deep network architectures was critically dependent on our incorporation of a replay algorithm" (4).
The paper does preprocessing to the images: it removes flickering and, it extracts "the Y channel, also known as luminance, from the RGB frame and rescale it to 84x84" (6). The Y channel or luminance is just the black and white brightness.
Estimate Q using the neural network.
"Q maps history-action pairs to scalar estimates of their Q-values." Previous approaches have used history-action pairs as inputs to the network. The drawback of this is that if you want to compute the Q value for a history, you must compute the output of the network for all possible actions which is expensive (6).
DQN uses only the history as input to the network, and has one output unit per action which corresponds to that action's Q-value (6).
The architecture of DQN is a few of convolutional layers with rectifier nonlinearities as activation functions (6). A rectifier nonlinearty is simply max(0, x) (wiki)).
"We clipped all positive rewards at 1 and all negative rewards at -1...[which] limits the scale of the error derivatives and makes it easier to use the same learning rate across multiple games" (6)
They used RMSProp gradient descent with a learning rate of 0.00025 and a mini batchsize of 32 (6).
RMSProp: Divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight. (source)
They used an ε-greedy behavior policy "with ε annealed linearly from 1.0 to 0.1 over the first million frames, and fixed at 0.1 after that. We trained for a total of 50 million frames...around 38 days...and used a replay memory of 1 million most recent frames" (6).
ε-greedy behavior policy: "the agent chooses the action that it believes has the best long-term effect with probability 1-ε , and it chooses an action uniformly at random, otherwise."
They used a frame-skipping technique where "the agent sees and selects actions on every kth frame instead of every frame, and its last action is repeated on skipped frames...we use k=4 for all games" (6).
The games were evaluated with ε = 0.05 (6). They were trained with the annealing process described above.
The agent "receives a reward rt representing the change in game score. Note that in general the game score may depend on the whole previous sequence of actions and observations; feedback about an action may only be received after may thousands of time-steps have elapsed" (6).
"It is impossible to fully understand the current situation from only the current screen xt. Therefore sequences of actions and observations...are input to the algorithm, which then learns game strategies depending on these sequences." "This formalism gives rise to a large but finite Markov Decision Process (MDP) in which each sequence is a distinct state" (6).
"We make the standard assumption that future rewards are discounted by a factor of γ per time-step (γ was set to 0.99 throughout), and define the future discounted return at time t as \(R_t = \sum_{t' = t}^T \gamma^{t' - t}r_{t'}\), in which T is the time-step at which the game terminates. We define the optimal action-value function \(Q^*(s, a)\) as the maximum expected return achievable by following any policy, after seeing some sequence s and then taking some action a, \(Q^*(s,a) = \max_\pi\mathbb{E}[R_t|s_t = s, a_t = a, \pi]\) in which π is a policy mapping sequences to actions" (6).
"The optimal action-value function obeys...the Bellman equation."
$$ Q^*(s,a) = \mathbb{E_{s'}}[r + \gamma \max_{a'} Q^*(s', a')|s,a] $$
where prime (') represents "next", so where \(s\) is the present state, \(s'\) is the next state.
"The basic idea behind many reinforcement learning algorithms is to estimate the action-value function by using the Bellman equation as an iterative update, \(Q_{i+1}(s,a) = \mathbb{E_{s'}}[r + \gamma \max_{a'} Q_i(s', a')|s,a]\). Such value iteration algorithms converge to the optimal action-value function, \( Q_i \to Q^*\) as \(i \to \infty\). In practice, this basic approach is impractical, because the action-value function is estimated separately for each sequence, without any generalization. Instead, it is common to use a function approximator to estimate the action-value function, \(Q(s,a; \theta) \approx Q^*(s,a)\)" (6).
This paper uses the convolutional neural network described above as the function approximator, \(Q(s,a; \theta))\). They call their approximator a Q-network since it approximates Q, and its weights are θ (7).
When estimating the value of \(Q^*(s,a)\) in the present, the bellman equation is approximated using the Q-network with parameters (θ) from the past. The loss function is the mean-squared error in the bellman equation. A result of this loss function is that "the targets depend on the network weights; this is in contrast with targets used for supervised learning, which are fixed before learning begins" (7).
"Rather than computing the full expectations in the above gradient, it is often computationally expedient to optimize the loss function by stochastic gradient descent" (7).
"The agent selects and executes actions according to an ε-greedy policy based on Q" (7).
Experience replay is effective because "each step of experience is potentially used in many weight updates, which allows for greater data efficiency." And "learning directly from consecutive samples is inefficient; owing to the strong correlations between the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates" (7).
To improve learning and the stability of the algorithm they "also found it helpful to clip the error term from the update...to be between -1 and 1" (7).
On page 7, the paper shows 10-20 lines of pseudocode representing the algorithm.
I found the following resources to be helpful when reading this paper
"Reinforcement learning differs from standard supervised learning in that correct input/output pairs are never presented, nor sub-optimal actions explicitly correct. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge)" [1].
Reinforcement learning deals with discrete time. At each time t an agent is in a state and receives some reward. It must choose an action to move to the next state. The goal of the agent is to maximize cumulative reward. "In order to act near optimally, the agent must reason about the long term consequences of its actions" [1].
MDP - Markov Decision Process. "MDPs provide a mathematical framework for modeling decision making in situations where outcomes are partially random and partly under the control of a decision maker" [2]. The first few paragraphs of this page are excellent for explaining what's going on.
policy - "a rule that the agent follows in selecting actions, given the state it is in" [3].
action-value function - "gives the expected utility of taking a given action in a given state and following the optimal policy thereafter" [3].
"One of the strengths of Q-learning is that it is able to compare the expected utility of the available actions without requiring a model of the environment" [3]. [3] is simple and concise in the first few paragraphs. I'd copy most of it down verbatim if I took notes.
The Power of Habit is a book about how habits work, what role habits have in our lives, and how habits can change.
The author uses the term habit or "habit loop" to describe a "cue, routine, reward" process. You experience a cue, a routine is invoked, and you receive a reward. When you no longer consciously choose which routine to execute, the routine has become habit.
In this book, the author nearly equates a reflexive action (cue -> action) and a habit (cue -> action -> reward). A habit doesn't have to be executed frequently to be a habit. For instance, in Chapter 9 a man strangled his wife to death when he was sleep walking. The author called the man's action a habit (the man executed the "defend your loved ones" program which involved strangling what he mistook to be a stranger laying on his wife). The man probably didn't strangle people frequently, but according to the author, this is still a habit. Note: there is some technical discussion in this chapter (and others) on how the basal ganglia in the brain is responsible for executing habits, and when people perform actions when sleepwalking, their brains look just like they are executing habits--pretty quiet everywhere except for the basal ganglia.
"When a habit emerges, the brain stops fully participating in decision making."
There's a lot of discussion in the book about compulsive behavior (eg drugs, gambling). This book boils compulsive behavior down to the "habit loop". We experience a cue (tired, hungry, bored, friend walks in, ...), we invoke the routine, and we get the reward, all without thinking.
Initially, we may have chose which action to take given a cue. After executing the habit repeatedly, we stop making a choice.
Changing habits is not about trying harder or wanting more. It's about understanding cues and rewards, and substituting routines. "It seems ridiculously simple, but once you're aware of how your habit works, once you recognize your cues and rewards, you're half way to changing it." Nathan Asrin -- developer of "habit reversal training"
You can't change cue and reward, they will always be there, but can change the routine in-between.
The book says the reason alcoholic's anonymous is so successful is because it deals with alcohol as a habit. You have to identify your cues, and use your sponsor or something else as a substitute response.
In chapter 2, Claude Hopkins used habits to sell. His rules were: (1) find a simple, obvious cue, and (2) clearly define the rewards. If you get these two right, "it's like magic."
Want to stop smoking? Figure out your rewards (eg structure to day, stimulation) and cues, and substitute a new response (eg pushups, caffeine, walks, etc).
Want to stop snacking at work? Identify cues and rewards. Maybe a 3 minute internet break, or a brief walk will work.
The people most successful at changing bad habits (or creating new, difficult ones) are those that think ahead of painful inflection points and plan responses to overcome them. Over time, these responses to cues become habits. For example, you know that feeling tired (a painful inflection point) discourages you from working out. So you plan ahead of time how to make sure you can have energy when you feel low energy (work out in morning, drink caffeine, take a nap, ...). This will become a routine.
I wanted to learn more about motivation. Why do people do what they do? I think that we perform many actions which we don't really think about. Audible recommended this book and it was the clear winner for next book.
It was pretty good. It stretched the concept of habit a bit far, but I appreciate the gist of it. It discussed habits in the workplace too, and the importance of crises for changing habits which I want to use when the time is appropriate.
Some lady was living recklessly--overeating, smoking. Her husband divorsed her, she set a goal of wanting to cross the desert, and believed she had to stop smoking in order to achieve the goal. This goal setting competed with her desire to smoke, and won (two regions of brain, one that lights up showing she is attracted to the food, one that lights up with her inhibitions).
Military is huge on habits. Saw that removing food vendors might prevent large gatherings turning violent (people are tired, hungry, and have nothing to throw).
"Nothing you can't do if you get your habits right"
We now understand how habits work, how to break them, change them, make them.
"Chunking" - turning behavior into habit
Basal ganglia
1) queue the habit, determine correct habit - spike in brain activity 2) operate the routine 3) awaken, make sure happened as expected, reward - spike in brain activity. Determine if the habit worth remembering for next time?
Habits are born through this process.
"When a habit emerges, the brain stops fully participating in decision making."
Unless you actively resist the habit, it will unfold.
Habits never really disappear. That is an advantage. The brain cant tell difference between bad and good habits. "If you have a bad one, it's always lurking there waiting for the right queues and the rewards."
Without cues, habits not invoked. Eugene couldn't find the way home if lots branches on street or construction.
Rewards can be external or internal.
"The craving Brain - How to create new habits"
Claude Hopkins - advertiser in the past. Use his understanding of habits. The secret is that he found a cue and habit to cultivate the habit of teeth brushing. He created a craving. That craving is what makes cues and rewards work. It is what powers the habit loop. He looked through medical dental books and found out about "the film" that occurs on teeth after eating. Even though eating an apple or running your tongue across your teeth would get rid of the film, it is exactly what Hopkins needed. He had a cue. "Just run your tongue across your teeth" you need the toothpaste to get rid of it (even though toothpaste wasn't effective at getting rid of it). He had created a habit by finding the cue. He claimed the film is what makes your teeth decay, what makes them turned yellow. He pointed to others (fallaciously) saying that the product made their teeth white and clean. The lies didn't matter, he had a cue. 2 basic rules: (1) find a simple, obvious cue, and (2) clearly define the rewards. If you get these two right, it is like magic.
Third rule that Hopkins overlooked because it was so obvious, but it is necessary.
Febreze - found the perfect sent-removing product. Didn't just mask, was cheap to manufacture. All Drake Stimson needed to do was figure out how to turn it into a habit. Nice lady just wanted to go on dates but worked with skunks. None of the friends can smell it. They decided that the key to frebreze was to market that reward/relief the lady felt. Cue: cigarette smells, pet smells. Reward: relief from smells. Febreze was failing. Grocery stores were full, none were being sold. Stimson: "At the very least, lets ask the phds what's going on." The lady didn't smell her 9 cats. She was desensitised to the smell. Cigarette smokers were desensitized to smelling the smell. The product's cue was hidden from the people who needed it most. Bad sents were noticed frequently enough to trigger the habit. The people who needed it most never noticed the smell. The cue wasn't a cue for those who needed it.
Julio the monkey - when he sees certain shapes on the screen, he is rewarded with a drop of blackberry juice. The blackberry juice was a reward, the brain lit up showing happiness. Soon the shapes triggered the happiness. Before Julio even pressed the lever to get the juice, his brain lit up showing he was happy when seeing the shape. He became frustrated or depressed if the juice didn't come when the shape appeared. Habits create cravings for rewards.
The smell of cinnabun in the mall gets people bringing out their wallet without thinking. They carefully put their kiosks away from other smells so you only smell their uninterrupted sweetness and are compelled to buy. The smoker will experience craving for cigarettes on the cue of the sight or smell of one. Not because of one encounter with a cigarette, but because of the habit.
Cue and reward aren't sufficient for habit, need craving.
Some doctor: I work hard because I expect pride, I exercise because I expect to feel good afterwards, I just wish I could pick and choose better.
Febreze researches found a lady who used it daily. She didn't use it to get rid of bad smells, she used it to make things smell clean. She had a ritual of using it after cleaning a room to make it smell good. So febreze changed the marketing to make febreze the product that makes your stuff smell clean, rather than eliminate bad odors. Most people didn't crave eliminating bad odors, but they did crave the fresh smell after they were done cleaning. It made them feel good, that the job was done. Febreze piggy-backed on the already present sensation they witnessed that people felt good after seeing the clean room. Now febreze had a craving, now people craved things could smell as good as they looked when they were done cleaning.
Other toothpastes used the "film" and white teeth claims. Pepsident didn't win because of these, pepsident won because the inventor put a little bit of mint oil and citric acid inside the formula to make your teeth smell better. People craved the cool tingling these ingredients created. Now they crave the foaminess and the cool feel, even though foaminess doesn't help the cleaning (same thing with shampoo with foaminess).
The key: create a craving. New habits form when we crave their rewards. (what about the habit of driving to work?) People exercise because they crave the endorphin rush. Successful dieters are successful because they crave wearing that new bikini, or (if they're like me) they want to avoid feeling unhealthy, and want to build muscle while doing so (because why not?).
Golden rule of habit change -- why transformation occurs
You can't change cue and reward, they will always be there, but can change the routine in between.
Dungee, coach of the bucks? used this. He didn't want to give them new habits, just change their old ones. His philosophy was that you don't want players thinking on the field. You want them executing more quickly than the other players. He had them repeat the same handful of plays until they were experts, they finally stopped thinking, executed more quickly, and won 10? years in a row?
always cue, routine, reward
AA forces you to identify cues and rewards, and change the routines. perfect habit changing.
Mandy the nail biter, identified boredom and sensation in fingertips as cue and sense of physical completion as reward. Substituted with putting hands in pocket, making a fist, grabbing something. The routine changed, cue and reward stayed the same...the golden rule of habit change.
"competing routine" one habit had replaced another.
"It seems rediculously simple, but once you're aware of how your habit works, once you recognize your cues and rewards, you're half way to changing it." nathan asrin developers of "habit reversal training"
"The brain can be reprogrammed, just have to be deliberate about it"
Want to stop snacking at work? I identify cues and rewards. Maybe a 3 minute internet break, or a brief walk will work.
Want to stop smoking because of structure, stimulation, ...? Same. Pushups, caffeine, walks, etc.
(AA) Replacing habits worked until shit hit the fan, unless they had spirituality. Belief itself is what helped. Belief was the ingredient. Belief that ended up allowing them to believe that they could make a perminant change. Belief that things will get better. Eventually they'll have a bad day, and no routine will help them make it thorugh it. What does help is the belief that they can make it through without alcohol.
Colts and the Bucks had the same problem. They had great routines, but when the pressure came, they didn't have the belief, and caved to their old routines of thinking. They needed to believe the routines would work so at moments of high pressure they would continue with them and succeed.
Organizational habits
Oneal worked in government for 16 years. Used lists for everything. Studied organizations and saw that organizational habits/routines are what differentiated organizations. Decided to take CEO position of company and change safety habits. Many investors left because he seemed crazy to focus only on safety in his speech.
Author notes that in nasa they needed to promote risk taking, the department managers would applaud when rockets exploded on the launch pad. Became a habit that increased risk-taking.
Keystone habits. Exercise, safety at Oneal's company. Why does changing this one habit propagate changes through life? How to identify?
Oneal had a requirement that any workplace injury be reported to him in 24 hours. This forced presidents to be in contact with vice presidents, vice presidents to be in contact with managers and floor managers, and managers to be listening to employees. And this communication line had to be responsive. It gave employees power to stop the line when they felt uncomfortable. All suggestions for safety coming from the employees were listened to. Faulty products were fixed resulting in less waste and higher quality metals. Productivity shot through the roof, and somehow they became more profitable.
Phelps - "Put in the video tape." - routine for playing the video tape in phelps' head for winning the race. He was to visualize, each morning and night entering the water, stroking, kicking, turning, etc. Visualize perfection in the race. Once Bowman got a few key habits in place, all the other habits fell into place (eating, stretching, practicing, ...). Phelps also had a routine for relaxing before bed--Tensing the muscles, then letting the tension melt away because he was stressed from family stuff.
"small wins" - small successes that creates patterns of success and open the door to large success. Gay rights movement got books reclassified.
Phelps starts with small wins--habits of waking up in the day, stretching, warming up, playing exactly the expected songs. Tons of wins already. By time he gets to the race, he's already made a ton of successes and this is just the next habit to execute. Phelp's goggles fogged and he set a world record by following the vision in his head. He knew he needed 19-21 strokes and pushed. All habits, all vision, all repetition and small wins. "WR" (world record) was another small win of just following habits and vision.
Alcoa - We killed this man. (accident). 2 weeks later, small win with lowered accidents. Sent out memo to entire company. People copied his memo, even paitned his face with the memo. Then a worked gave a suggestion to management which helped them make millions - we were already giving safety suggestions, why not give this other suggestion? Small wins.
Oneal Finding root causes with infant mortality rate in US. Discovered it was malnutrition of teenage mothers in rural areas, then found out that high school teachers couldn't teach nutrition because they didn't know enough about biology. The root cause was the high school teacher's education. Implemented plan to teach everyone about biology so they could teach these high school kids about nutrition so they can increase child health. Small wins: ability to trace root causes in government.
Used to advise people to radically change their lives in order to lose weight. It started well, but people lost interest. Piling on too much change at once made it impossible for any of it to stick. Then a research group in 2009 tried something different. They just wanted their obese subjects to create a food journal and once a week write down everything they ate. All they asked for was this. Soon it turned into habits, and this small win lead to other wins. Without the researchers asking, the obese subjects started noticing patterns and planning meals. Noticing patterns - some noticed they snacked at certain times of day, so brought a healthy snack with them. Planning meals - some saw what they ate all written down and planned a healthier meal for dinner.
Oneal said they needed a real-time way to share safety information worldwide. They created worldwide corporate email which worked for just this, then turned into ways to share pricing information and information about competitors. They were ahead of their competition by years.
Alcoa senior manager in new Mexico hid a safety incident about fumes. Oneal discovered. The safety culture made the decision clear, he was fired, and one manager said he fired himself.
"Starbucks and the habit of success - when willpower becomes automatic"
Travis - son of two heroin/crank addicts. Tough life. Quit high school from pressure, exploded and cried at work. Starbucks' training taught him life skills he was missing from school and parents.
Studies in 80s (?) discovered that willpower was the #1 predictor of success (eg 4 year olds rewarded with a second marshmallow if they can abstain from eating the first for a few minutes). By end of Harvard research, they discovered willpower was teachable, a skill.
But Marvin (?) and colleagues wondered, if willpower is a skill, why does it seem to fluctuate over time? My skill at making omelets doesn't fluctuate over the week, but my willpower does.
Willpower is a muscle, not a skill. Frustrated faster after exerting willpower. Strenthened willpower in exercise, money management, or study skills results in strengthened willpower in tv, studying, exercise, healthy foods, less alcohol and cigarettes.
Starbucks - we're not in the coffee business serving people, we're in the people business serving coffee.
How is willpower a habit? By thinking ahead of painful inflectionpoints and planning responses to overcome them. Over time, these responses to cues become habits. The people who wrote down plans on how to deal with painful cues recovered twice as fast as those who didn't. Starbucks now does the same thing. They suggest ways for their employees to respond to painful inflection points. The idea is, know what dangers, pains, what temptations and easy way outs lie ahead, and prepare for how to respond to them so you aren't overtaken by them.
Studies on willpower and how you treat the subjects. Tell them don't eat the cookies nicely or harshly. If you tell them don't eat the cookies nicely, they have lots of willpower to spare. If you tell them don't eat the cookies harshly, they are out of willpower. Found out this is because when you tell them things nicely, they feel like they are in control. (they are requested not to eat it and given reasons why they shouldn't, not ordered not to eat it). Same thing happened at starbucks. Rather than saying where the merchandise goes and where the blender goes and how to greet customers, starbucks employees decide these things. Given sense of control boosts productivity.
The power of a crisis - how leaders create habits through accident and design
Old man fell and hurt his head, blood was pooling in brain and needed surgery quickly to relieve pressure. Doctor drilled in the wrong side and hospital was sued for malpractice. Turns out the hospital had very arrogant doctors and the doctor signed the paper saying it was the right side of the brain even though he didn't know. He had glanced at the images but had mistakenly thought the bleeding was on the right side when it was actually on the left side. Nurses tried to speak up but doctor's arrogance halted the conversation and they drilled in wrong side. Moral of the story:
Every organization has habits, some are accidental habits, others are intentional. If the leaders of the organization don't pay attention to the habits and intentionally guide them, the habits will emerge out of chaos, oftentimes based on fear.
Paper on economics studying lots of organizations of decades: It may seem like organizations' decisions are guided by careful scrutiny and decision making, but actually their actions are guided by habits formed by thousands of employees' independent decisions over years.
Routines/habits are necessary or nothing would ever get done.
Crises make habits malleable. It's better to use them then let them die down. Wise leaders prolong the sense of emergency after crises.
The hospital from earlier used the crisis to change the culture of the hospital. A hospital leader made the crisis bigger and longer by inviting investigators. Now surgeons and nurses they have checklists, nurses may interrupt for a timeout, and every 3 months doctors must describe an error or mistake in front of all their peers (vocally self-critical) (post-mortem). They learn how to embrace mistakes and learn from them and let others learn from them rather than hiding them.
How Target knows what you want before you do - When companies predict and manipulate habits
A Statistician hired by Target had to figure out, based on the tons of data Target is collecting, which customers were likely pregnant. Target was going to used this data to try to make more money off of these customers since they likely did all their shopping at the same store.
There are agencies that sell information such as: which products you mention favorably online, how many cars you have, how much money you make, .... Target uses this along with your purchasing behavior to make sure it's making the most money off of you it can.
Side note: This is disgusting to me! I don't like being manipulated. I will not do this sort of data science. I want to use data to help people, not to manipulate them into behavior that puts more money in the pockets of rich corprate leaders and investors. At SentiMetrix our data science was going to help people get diagnosed more quickly, cheaply, and correctly and make medical agencies more money. Win-win. Not lose-win!
Pregnant women are "gold mines" ...items (diapers, baby bottles) that companies like target sell at a significant profit
The hard part is not using this data without letting customers know they are tracking every details of their lives (creepy).
The song "hey ya" failed even though everything said it'd be great because it was too different. People need things to seem familiar, can't judge a song every time it comes on the radio, the "sticky" songs are the ones that sound just like you'd expect them to sound, the archytype of the genre.
So how do you get people to do something new without freaking out because it's too different? How can Target send pregnant women ads without raising an alarm? Dress something new in old clothes. It's gotta be familiar.
YMCA figured out that people start at the gym because of equipment, but stay because of social things (like employees knowing their names, or meeting workout buddies).
Sattleback church and the Montgomery bus boycott - how movements happen
Three part process that shows up again and again.
Other black people got arrested for defying bus seating laws, but Rosa Parks' incident caused a protest because she was deeply respected and embedded within her community.
Usually hard to stand up for a stranger's injury, but very easy to stand up for a friend's being treated with injustice. Rosa parks had lots of friends from different groups, she had "strong ties", she gave way more than she received.
Weak tie acquaintances allow us to get into jobs we otherwise don't know about or have access into. Weak tie acquaintances are often more important than strong tie friends. Power of weak ties helps explain how protests can expand from close friends to thousands of people. If you don't aren't helpful to your acquaintances, word can spread that you're not a team player, and you'll lose the benefits of being part of the clubs and cliques you're apart of. "peer pressure" is how things spread beyond close friends.
Peer pressure got people to boycott the Montgomery busses. All came together in 5 days. Community felt obligated to boycott for fear for anyone who didn't participate wasn't someone you wanted to be friends with.
Social group's expectations explained who went to the freedom voting thing.
Sattleback church made small groups to solve leader's depression problems. They got the friends from the small group and the community peer pressure from the congregation. All of us are a bundle of habits. Sattle-back church creates habits of daily reflection, tithing, and small groups.
An idea must become self-propelling for a movement to take place. Give them new habits to figure out where to go on their own.
Free-will
Gambler who created habits...
Man who strangled his wife to death thinking it was someone sleeping with his wife.
Jury ruled he was not guilty, but bacman the gambler was guilty. Both were operating on habits. The man was in a sleep terror operating on habits that weren't able to be stopped.
MRI study: People with gambling problems react to near misses the same way they react to wins. People without gambling problems react to near misses like losses.
You want to know why lottery profits have grown? Every other scratch ticket is designed to make you feel like you almost won. People who equate near misses with wins are the people who make the lottery profitable.
Similar cases with people on drugs not able to resit urges to gamble, winning settlements of millions from pharmaceutical companies. Their brains look very similar, they are compelled to gamble, but bachman was ruled as having control over her actions and people on drugs ruled as not having control.
Aristotle and habits
Difference between bacman and sleepwalking murderer: bacman was conscious of her gambling habit, she had the ability to change her habit. Without changing the habit, she was powerless when a cue arose, but she had the ability to change her habit. She had the ability to put herself on the "do not gamble" list. The sleep-walking murderer wasn't aware that he could murder in his sleep, he couldn't have prepared for it.
Your habits, your involuntary responses to cues, control your destiny. You control your habits by being aware they exist and how they work.
Habits-The actions you don't consciously choose anymore; they've become automatic/routine.
This article contains my notes and thoughts on the paper Gradient-Based Learning Applied to Document Recognition.
The paper was published in 1998 by 4 individuals from AT&T Labs (43). The authors devised an algorithm to automatically locate and read dollar amounts from checks (37-39). They put their algorithm into use in June 1996 and for years after it was reading millions of checks per day (40).
The threshold of economic viability for automatic check readers, as set by the bank, is when 50% of the checks are read with less than 1% error. The other 50% of the check[s] being rejected and sent to human operators (37).
Requirements:
The system works by computing a graph of possibilities where each path from start to finish in the graph is a candidate for what might be the amount. Then it chooses the best path through the graph. The way it does so is by using a GTN or Graph Transformer Network, the primary subject of this paper.
The main message of this paper is that better pattern recognition systems can be built by relying more on automatic learning, and less on hand-designed heuristics (1).
The crucial claim of this paper is that global training is more effective (and less costly) than local training. "Hand-crafted feature extraction can be advantageously replaced by carefully designed learning machines that operate directly on pixel images" (1).
Examples in depth on (4).
"With real training data, the correct sequence of labels for a string is generally available, but the precise locations of each corresponding character in the input mage are unknown" (29).
The authors build their check reading system using a graph transformer network or GTN. A GTN can be used as a function--given some input, calculate an output. This paper passed an image of a check to a GTN as input and received the amount written on the check as a floating point number (eg "1", "4.07", "2,050.33") as output.
A GTN (Graph Transformer Network) is a system of connected components (or steps) where each component receives a graph as input and returns a graph as output. The graphs the GTN returns are special--each path through the graph from the start node to the terminal node represents a solution to the problem, and each edge carries a weight/number representing an error. After training the GTN, the path through the graph with the smallest accumulated error from start node to end node is likely the correct solution. Training a GTN is covered in a section below.
The GTN in the final algorithm (36) that ended up reading millions of checks per day was similar to one illustrated below. (As an example of a difference, it had another step before the segmenter which determined several candidate locations for where the amount box might be located on the check).
The GTN in the image above has a few steps (the first step is at the bottom):
The Object Oriented GTN approach uses modules that define an "fprop" method and a "bprop" method. The design is generalizable to GTNs with cycles (17).
"In general, the bprop method of a function F is a multiplication by the Jacobian of F...The bprop method of a fanout (a "Y" connection) is a sum...The bprop method of a multiplication by a matrix is a multiplication by the transpose of that matrix..." (17).
"Interestingly, certain non-differentiable modules can be inserted into a multi-module system without adverse effect. An interesting example of that is the multiplexer module. It has two (or more) regular inputs, one switching input, and one output. The module selects one of its inputs, depending upon the (discrete) value of the switching input, and copies it on its output. While this module is not differentiable with respect to the switching input, it is differentiable with respect to the regular inputs. Therefore the overall function of a system that includes such modules will be differentiable with respect to its parameters as long as the switching input does not depend upon the parameters" (18).
"Another interesting case is the min module. This module has two (or more) inputs and one output. The output of the module is the minimum of the inputs. The function of this module is differentiable everywhere, except on the switching surface...Interestingly, this function is continuous ans reasonably regular, and that is sufficient to ensure the convergence of a Gradient-Based Learning algorithm" (18).
Graph Transformers - a module that take one or several graphs as input and produce graphs as output (18).
Graph Transformer Networks - a network of Graph Transformers. "Modules in a GTN communicate their states and gradients in the form of directed graphs whose arcs carry numerical information (scalars or vectors)" (18).
A GTN has several parameters which need to be tuned in order to maximize its ability to give the correct answer. For instance, the Recognition Transformer may have parameters representing which region in the image is important to examine when trying to classify a "4" vs a "5". Rather than some expert making a decision on what the best parameters are, the system is able to automatically compute the best (or very good) parameters through gradient-based learning.
"If the partial derivative of Ep with respect to Xn is known, then the partial derivatives of Ep with respect to Wn and Xn - 1 can be computed using the backward recurrence
$$ \begin{align} \frac{\partial E^p}{\partial W_n} = \frac{\partial F}{\partial W} (W_n, X_{n-1}) \frac{\partial E^p}{\partial X_n} \\ \frac{\partial E^p}{\partial X_{n-1}} = \frac{\partial F}{\partial X} (W_n, X_{n-1}) \frac{\partial E^p}{\partial X_n} \end{align} $$
where \( \frac{\partial F}{\partial W} (W_n, X_{n-1}) \) is the Jacobian of F with respect to W evaluated at the point \( (W_n, X_{n-1}) \) ... The above formula uses theproduct of the Jacobian with a vector of parital derivatives, and it is often easer to compute this product directly without computing the Jacobian beforehand" (5).
"\( X_n \) is a vector representing the output of the module, \(W_n\) is a vector of the tunable parameters in the module...and \(X_{n-1}\) is the module's input vector (as well as the previous module's output vector). The input \(X_0\) to the first module is the input pattern" (5).
TODO brief detour probably here. I think the equation above might be wrong.. (an n -> n-1 or n-2 somewhere?) because it's not a recurrence.
"Convolutional networks combine three architectural ideas to ensure some degree of shift, scale, and distortion invariance: local receptive fields, shared weights (or weight replication), and spatial or temporal sub-sampling" (6). "The input plane receives images of characters that are approximately size-normalized and centered" (6).
feature map - a plane of features resulting from a CNN operation.
sub-sampling layers - to produce reduced resolution feature maps reduces "the sensitivity of the output to shifts and distortions" (6). "Successive layers of convolutions and sub-sampling are typically alternated, resulting in a 'bi-pyramid': at each layer, the number of feature maps is increased as the spatial resolution is decreased" (7).
"Once a feature has been detected, its exact location becomes less important. Only its approximate position relative to other features is important...Not only is the precise position of each of those features irrelevant for identifying the pattern, it is potentially harmful because the positions are likely to vary for different instances of the character" (6).
"Convolutional networks can be seen as synthesizing their own feature extractor" (7). "The weight sharing technique has the interesting side effect of reducing the number of free parameters, thereby reducing the 'capacity' of the machine and reducing the gap between test error and training error" (7).
"Fixed-size convolutional networks that share weights along a single temporal dimension are known as Time-Delay Neural Networks (TDNNs)" (7).
"The reason [that the input is significantly larger than the largest character in the database] is that it is desirable that potential distinctive features such as stroke end-points or corners can appear in the center of the receptive field of the highest-level feature detectors" (7).
"The values of the input pixels are normalized so that the background level (white) corresponds to a value of -0.1 and the foreground (black) corresponds to 1.175. This makes the mean input roughly 0, and the variance roughly 1 which accelerates learning" (7).
"Why not connect every S2 feature map to every C3 feature map? The reason is two fold. First, a non-complete connection scheme keeps the number of connections within reasonable bounds. More importantly, it forces a break of symmetry in the network. Different feature maps are forced to extract different (hopefully complementary) features because they get different sets of inputs" (8).
"All the quantities manipulated are viewed as penalties, or costs, which if necessary can be transformed into probabilities by taking exponentials and normalizing" (19).
"Finally, the output layer is composed of Euclidean Radial Basis Function units (RBF), one for each class, with 84 inputs each...each output RBF unit computes the Euclidean distance between its input vector and its parameter vector" (8).
The RBF unit's weights were designed, not chosen arbitrarily. Usually in a n-class classification problem you might have a n-output final layer maximizing one output and minimizing all the rest for each of the n classes. In this paper they chose to represent the output layer as stylized images of characters instead of an n-output layer. The output layer of size 7x12 (=84) represented a stylized character which means that similar characters appeared closer together, and a component on top could reason about what is the proper character given the context of the surrounding characters (8).
"Another reason for using such distributed codes rather than the more common "1 of N" code (also called place code, or grand-mother cell code) for the outputs is that non distributed codes tend to behave badly when the number of classes is larger than a few dozen. The reason is that output units in a non-distributed code must be off most of the time. This is quite difficult to achieve with sigmoid units" (8).
"Saturation of the sigmoids must be avoided because it is known to lead to slow convergence an dill-conditioning of the loss function."
"The role of the Viterbi transformer is to extract the best interpretation from the interpretation graph" (19). The interpretation graph is the graph of all "possible interpretations for all the possible segmentations of the input" (19). "The Viterbi transformer produces a graph Gvit with a single path...[which] is the path of least cumulated penalty in the Interpretation graph" (20).
"...takes the interpretation graph and the desired label sequence as input. It extracts from the interpretation graph those paths that contain the correct (desired) label sequence. Its output graph GC is called the constrained interpretation graph (also known as forced alignment in the HMM literature), and contains all the paths that correspond to the correct label sequence.
generative models - learn p(x, y)
discriminative models - learn p(x | y)
the collapse problem - "The minimum of the loss function is attained, not when the recognizer always gives the right answer, but when it ignores the input, and sets its output to a constant vector with small values for all the components...[this] only occurs if the recognizer outputs can simultaneously take their minimum value...only occurs if the recognizer outputs can simultaneously take their minimum value" (22). Can't occur if RBF values are fixed and distinct.
"A modification of the training criterion can circumvent the collapse problem...and at the same time produce more reliable confidence values. The idea is to not only minimize the cumulated penalty of the lowest penalty path with the correct interpretation, but also to somehow increase the penalty of competing and possibly incorrect paths that have a dangerously low penalty. This type of criterion is called discriminitive, because it plays the good answers against the bad ones. Discriminative training procedures can be seen as attempting to build appropriate separating surfaces between classes rather than to model individual classes independently of each other" (22).
Back propagate Edvit = Ccvit - Cvit where Ccvit is the penalty of the best constrained path and cvit is the dpenalty of the best unconstrained path (23). (after back propagating, to the interpretation graph...) If the best constrained path = best unconstrained path (cvit = vit) then we propagate 0 error backwards. If an arc appears in the constrained best path (cvit) but not in the unconstrained best path (vit) then the gradient is +1. If the arc is in vt but not cvit the gradient is -1. (23).
"The main problem [with the discriminative viterbi algorithm] is that the criterion does not build a margin between the classes. The gradient is zero as soon as the penalty of the constrained viterbi [(best)] path is equal to that of the viterbi path" (24).
"...it could be argued that...multiple paths with identical label sequences are more evidence that the label sequence is correct" (24).
There are many ways to combine the penalties of multiple paths. The forward algorithm computes the forward penalty efficiently = "the penalty of an interpretation should be the negative logarithm of the sum of the negative exponentials of the penalties of the individual paths. The overall penalty will be smaller than all the penalties of the individual paths. This algorithm uses logadd which can be seen as a soft version of the min function (24). \(-log(\sum_{p \in \text{paths}} e^{-\text{penalty of path p}})\) "The forward penalty is always lower than the cumulated penalty on any of the pahts, but if one path dominates (with a much lower penalty), its penalty is almost equal to the forward penalty (25).
"The Forward training GTN is only a slight modification of the...Viterbi training GTN. It suffices to turn the Viterbi transformers...into Forward Scorers that take an interpretation graph as input and produce the forward penalty of that graph on output. Then the penalties of all the paths that contain the correct answer are lowered, instead of just that of the best one" (25).
"The advantage of the forward penalty with respect to the Viterbi penalty is that it takes into account all the different ways to produce an answer, and not just the one with the lowest penalty" (25).
discriminitive forward criterion - "maximization of the posterior probability of choosing the paths associated with the correct interpretation. This posterior probability is defined as the exponential of the minus the constrained forward probability, normalized by the exponential of minus the unconstrained forward penalty" (25).
"Discriminative forward training is an elegant and efficient way of solving the infamous credit assignment problem...the same idea can be used in all situations where a learning machine must choose between discrete alternative interpretations" (26).
"sweep a recognizer at all possible locations across a normalized image...the system essentially examines all the possible segmentations of the input" (27).
Three problems:
Use a "replicated convolutional network, also called a Space Displacement Neural Network or SDNN...convolutional networks can be scanned or replicated very efficiently over large, variable-size input fields" (27).
Uses a "grammar transducer, more specifically a finite-state transducer that encodes the relationship between input strings of class labels and corresponding output strings of recognized characters." "A transducer therefore transforms a weighted symbol sequence into another weighted symbol sequence." (28)
SDNNs can be used for object detection and spotting. Using multiple resolutions is helpful (30).
I started reading this paper after taking a look at the paper from DeepMind on how they got software to learn to play Atari (Playing Atari with Deep Reinforcement Learning) (video). That paper is shorter (9 pages), assumes a lot of knowledge from the reader, and references this paper. This paper is longer (46 pages) and explains new concepts in detail. After I read a few pages from this paper, I found it to be like a concise and intense bout of learning that was slightly/moderately higher than my current level of understanding. This paper was the perfect next step on my learning quest.
I need to practice and read more on derivatives and gradients. They are used a lot.
My work for some exercises for this chapter can be found at github.com/joshterrell805/Introduction_to_Probability_Grinstead/tree/master/4
Chapter 4 is about conditional probability.
\(P(F|E)\) - the conditional probability of event F given that event E has occurred
"In the absence of information to the contrary, it is reasonable to assume that the probabilities" for each outcome in E "should have the same relative magnitudes that they had before we learned that E had occurred."
The book shows a cool derivation of the following:
$$ P(F|E) = \frac{P(F \cap E)}{P(E)} $$
bayes probability - aka "inverse probability" allows us to invert the probabilities. If we know P(A|B) we can find P(B|A).
$$ P(H_i|E) = \frac{P(H_i \cap E)}{P(E)} \\ = \frac{P(H_i)P(E|H_i)}{\sum_{k=1}^m P(H_k \cap E)} \\ = \frac{P(H_i)P(E|H_i)}{\sum_{k=1}^m P(H_k)P(E|H_k)} $$
...where H is used to represent "hypothesis" and E is used to represent "evidence". Often we want to know the probability of the hypothesis (eg medical diagnosis) given the evidence, but we only know the probability of the evidence given the hypothesis. The Bayes' formula allows us to invert the probabilities. This assumes the hypotheses are disjoint.
independent events - F is independent of E if \(P(F|E) = P(F)\) and \(P(E|F) = P(E)\) ("each equation implies the other.") "Two events E and F are independent if and only if \(P(F \cap E) = P(E)P(F)\)."
mutually independent - "A set of events \(\{A_1, A_2, \dots, A_n\}\) is said to be mutually independent if for any subset \(\{A_i, A_j, \dots, A_m\}\) of these events we have \(P(A_i \cap A_j \cap \dots \cap A_m) = P(A_i)P(A_j) \dots P(A_m)\)." "If all pairs of a set of events are independent," the whole set is not necessarily mutually independent. "It is important to note that the statement \(P(A_1 \cap A_2 \cap \dots \cap A_n) = P(A_1)P(A_2) \dots P(A_n)\) does not imply that the events \(A_1, A_2, \dots, A_n\) are mutually independent."
joint distribution - the distribution of the co-occurrence of multiple outcomes/random variables (which may or may not be independent). Ex: the probability distribution function over seeing a live chicken and meeting the president today. Possible outcomes: (chicken, president), (no chicken, president), (chicken, no president), (no chicken, no president). If the random variables are mutually independent, the joint distribution function is just the product of the distribution functions of the random variables.
independent trials process - "sequence of random variables...that are mutually independent and that have the same distribution is called a sequence of independent trials or an independent trials process." Can be used to model repeating an experiment some number of times.
Recall \(f(x)\) is the density function such that \(\int_{-\infty}^{+\infty} f(x) = 1\).
continuous density function
$$ f(x|E)= \begin{cases} f(x)/P(E), & \text{if } x \in E\\ 0, & \text{if } x \notin E \end{cases} $$
continuous conditional probability
$$ P(F|E) = \int_F f(x|E)dx $$
beta distribution - "The Beta distribution is best for representing a probabilistic distribution of probabilities- the case where we don't know what a probability is in advance, but we have some reasonable guesses." - stats.stackexchange. wiki, math.utah
I recently finished listening to "The Soul of a New Machine" by Tracy Kidder.
I took almost zero notes when reading this book, but instead just listened and tried to absorb the story while cooking and cleaning.
To me, one of our greatest assets as humans is the ability to "Stand on the Shoulders of Giants"—to grow by passing on knowledge and experience from generation to generation. I turned to "historical non-fiction" in hopes of learning more directly from others' experiences.
Unlike everything else I've read, learned from, and posted about thus far, this book is a story book. What's different about a story is that it doesn't tell me in detail about something technical, and it doesn't tell me about how to do something, it conveys experience. My hope is to learn from others' successes and failures.
Tracy tells the story about a team of engineers employed at Data General making a new CPU, the Eclipse. He lets you talk with the manager, the senior engineers, and the "micro kids" (college grad new-hires) from a little bit before the CPU started being created to after it was released.
The engineers "didn't work for money." They we're building something awesome. West (manager) did not pat on the back. He stayed out of their way and let them design, build, and test it. Also, without them knowing, knew what was going on with project and solved problems no one knew existed (ex, the special cable). He did not show worries to team. Nobody asked the team to work overtime. They did it on their own and created a culture to foster living and breathing the project. West selected ambitious, smart engineers who really wanted to put their name on something and have an opportunity to build and not be some cog in a company like IBM. He hired intelligent engineers that were willing to forgo family and leisure for the chance to build something.
After the project was released, the regional manager had a pep talk: "What motivates people? Ego and money to buy what they and their families want." This was a new day. Clearly the machine no longer belonged to the team and its makers.
I think a lot of the learning of this book is stuck somewhere in my head, but I'd like to jot down a couple of thoughts.
It was good to read about the micro-kids. I too want to build something cool, and oftentimes I find myself driving too far toward the future and not taking a moment to enjoy life. Tracy's note was brief, but he did mention that the kids would be burnt out at some point. That's what happens to them. I think a better, longer-term plan is to balance drive with appreciation--to work towards the future and appreciate the present in concert. I think the modern-day analogy is working for a start-up. You do tons of work and with crummy compensation, for the chance to reap large rewards and create something that's your own. My plan is to instead slowly and steadily keep learning and keep building mastery until my skills and experience are great, not one product or idea. I'll reach my career prime much later, but in waiting I think I'll be much happier now and later. In the end, their product that they worked so hard on made up for 10% of the company's revenue and then the company slowly declined.
While I think the burn-out work-style was poor, I think West did something great. He helped the team succeed not by telling them what to make, but by communicating the importance of the thing they were working on, and by standing back and letting them have ownership. Ownership is one of Amazon's principles, and one I'm finding more and more important.
My work for some exercises for this chapter can be found at github.com/joshterrell805/Introduction_to_Probability_Grinstead/tree/master/3
Chapter 3 is about combinatorics, and I took a combinatorics class in college, but this chapter kept my attention by talking about some very interesting historical problems.
"Let A be any finite set. A permutation of A is a one-to-one-mapping of A onto itself."
Notation: σ is the mapping symbol; elements map from top to bottom:
$$ \sigma = \left(\begin{array}{ccc} a & b & c \\ b & a & c \end{array}\right) $$
Permutations of events: "A task is to be carried out in a sequence of r stages. There are n1 ways to carry out the first stage; for each of these n1 ways, there are n2 ways to carry out the second stage..The total number of ways in which the entire task can be accomplished" is \(N = n_1 \cdot n_2 \dots n_r\).
falling factorial - the number of permutations of length r from a set of size N (notation: nr) (aka "n lower r" or "n down r") is
$$ n_r = \frac{n!}{(n-r)!} $$
"Let \(a_n\) and \(b_n\) be two sequences of numbers. We say that \(a_n\) is asymptotically equal to \(b_n\), and write \(a_n \sim b_n\), if
$$ \lim_{n \to \infty} \frac{a_n}{b_n} = 1 $$
sterling's formula
$$ n! \sim n^ne^{-n}\sqrt{2\pi n} $$
Some interesting permutation problems:
combinations - the distinct subsets of some set U that have exactly j elements. U is a set with n elements. "binomial coefficient" = "n choose j" = \(\binom{n}{j}\)
$$ \binom{n}{j} = \binom{n-1}{j} + \binom{n-1}{j-1} = \frac{n_j}{j!} = \frac{n!}{j!(n-j)!} $$
bernoulli trials process - "sequence of n chance experiments such that (1) each experiment has two possible outcomes, which we may call success and failure. (2) The probability p of success on each experiment is the same for each experiment, and this probability is not affected by any knowledge for previous outcomes."
"probability that in n Bernoulli trials there are exactly j successes:"
$$ b(n,p,j) = \binom{n}{j}p^{j}q^{n-j} $$
binomial theorem
$$ (a + b)^n = \sum_{j=0}^{n}{\binom{n}{j}a^j b^{n-j}} $$
The book describes an experiment where asprin works 60% of the time to alleviate headaches. We want to test a new drug to determine whether it is more effective than standard asprin for alleviating headaches. We are to randomly select n=100 patients to try the new drug (double blind, of course).
In this experiment, the critical value is a number between 0 and n=100 that we determine before running the experiment. If more than "critical value" people experience an alleviated headache from this new drug, we'll say that the new drug is more effective than asprin. If the critical value were <= 60, than we would falsely be saying that the new drug is more effective than asprin even though the 60% of people have alleviated headaches with asprin and <= 60% of people have alleviated headaches with our new drug. Therefore the critical value must be > 60. But how much greater?
We want to set the critical value high enough to where both the type-1 and type-2 errors are improbable. Because of variance from experiment to experiment, the effectiveness of the drug cannot be determined simply by comparing to 60. (It's possible, and not very unlikely, to flip 4 heads in a row with a fair coin).
type 1 error - The error we make when we mistakenly conclude that the new drug is more effective than asprin (because we observe >= "critical value" people with alleviated headaches) even though the drug is no more effective than asprin.
type 2 error - The error we make when we mistakenly conclude that the new drug is no more effective than asprin (because we observe < "critical value" people with alleviated headaches) even though the drug is more effective than asprin.
The program power-curve.py calculates the range of critical values that ensure the type-1 and type-2 error rates are low, and it draws the power curves for the smallest and largest critical values that meet these criteria.
This graph shows that for all critical values in the range [69, 73]:
This section went over shuffling cards in order to make the deck random.
My work for some exercises for this chapter can be found at github.com/joshterrell805/Introduction_to_Probability_Grinstead/tree/master/2
The chapter starts off by mentioning that there's a problem with using the discrete methods of chapter 1 to represent an Ω that contains an uncountably infinite amount of outcomes. If we assign all the outcomes a positive amount, ε, then sum of the probabilities of all outcomes in Ω is ∞, not 1. If we assign all the outcomes a 0 probability, then the sum of probabilities is 0, not 1. This problem was elaborated on more in Aidan Lyon's Philosophy of Probability. In section 2.2 the authors describe how to build a probability model in the case of an uncountably infinite amount of outcomes.
rnd - "returns a random real number in the interval [0, 1]. ...the values are determined by an algorithm, so a sequence of such values is not truly random. Nevertheless, the sequences produced by such algorithms behave much like theoretically random sequences."
"It is sometimes desirable to estimate quantities whose exact values are difficult or impossible to calculate exactly. In some of these cases, a procedure involving chance, called a Monte Carlo procedure, can be used to provide such an estimate."
The book goes on to give an example of calculating the area under \(y = x^2\) where \(0 \le x \le 1\) and \(0 \le y \le 1\) using simulation. It picks 10k pairs of (x, y) within the bounds and finds the proportion where \(x^2 \le y\). The area under the curve is approximated by the proportion of points meeting the inequality multiplied by the area of the bounds, which is 1. The area is successfully approximated to be roughly 1/3.
"When we simulate an experiment of this type n times to estimate a probability, we can expect the answer to be in error by at most \(1 / \sqrt{n}\) at least 95 percent of the time." Later on the chapter discusses that this estimate in error is only valid when certain conditions are met, but doesn't elaborate on exactly what those circumstances are or how to adjust the formula if the circumstances are different.
Finally, this section goes over Buffon's Needle for approximating π and Bertrand's Paradox (this one has funny gifs :)) and some history of the problems in this section.
This section left me wanting to know more about Monte Carlo simulations and correctly estimating their error.
This section deals with "assigning probabilities to the outcomes and events" of experiments where there are an uncountably infinite number of outcomes.
"Let \(X\) be a continuous real-valued random variable. A density function for X is a real valued function \(f\) which satisfies"
$$ P(a \le X \le b) = \int_a^b f(x)dx $$
$$ P(X \in E) = \int_E f(x)dx $$
"One can consider \(f(x)dx\) as the probability of the outcome \(x\)... \(f(x)\) is called the density function of the random variable \(X\). The fact that the area under \(f(x)\) and above an interval corresponds to a probability is the defining property of density functions."
"It is not the case that all continuous real-valued random variables possess density functions."
uniform or equiprobable - density functions for which the probability that any event E1 occurs is equal to the probability that any other event E2 occurs if E1 and E2 have the same number of outcomes.
"A glance at the graph of a density function tells us immediately which events of an experiment are more likely."
"Let \(X\) be a continuous real-valued random variable. Then the cumulative distribution function of X is defined by the equation"
$$ F_X(x) = P(X \le x) $$
"If X is a continuous real-valued random variable which possesses a density function, then it also has a cumulative distribution function."
"It is quite often the case that the cumulative distribution function is easier to obtain than the density function...Once we have the cumulative distribution function, the density function can be easily obtained by differentiation."
$$ \frac{d}{dx}F(x) = f(x) $$
distribution - shorthand for cumulative distribution function, \(F(x)\)
density - shorthand for probability density function, \(f(x)\)
exponential density - Useful for representing an experiment where an event happens after a random amount of time. \(X\) denotes "the time between successive occurrences." \(f(t) = \lambda e^{-\lambda t}\) where \(\lambda\) "represents the reciprocal of the average value of X." "To simulate a value of X, we compute the value of the expression \((-1/\lambda)log(rnd)\)." The exponential density function has the memoryless property, "the amount of time that we have to wait for an occurrence does not depend on how long we have already waited. The only continuous density function with this property is exponential density."
Resources
My work for some exercises for this chapter can be found at github.com/joshterrell805/Introduction_to_Probability_Grinstead/tree/master/1
random variable - "an expression whose value is the outcome of a particular experiment"
distribution function - function which maps outcomes to probabilities
frequency concept of probability - "if we have a probability p that an experiment will result in outcome A, then if we repeat this experiment a large number of times we should expect that the fraction of times that A will occur is about p."
The chapter mentions a few other concepts like Bernoulli trials and the law of large numbers, but promises to discuss them in later chapters so we'll wait until then to take notes on them.
"The real power of simulation comes from the ability to estimate probabilities when they are not known ahead of time."
"Accurate results by simulation require a large number of experiments."
The book gives an example of flipping a fair coin an even number of times. If the coin lands as heads, Peter wins a penny. If the coin lands as tails, Peter loses a penny. "It is natural to ask for the probability that he will win j pennies" (where j can range from -n to +n where n is the number of tosses). "It is reasonable to guess that the value of j with the highest probability is j=0" (Peter wins and loses the same number of pennies). Likewise j=+/-n intuitively have the lowest probabilities.
"A second interesting question about the game is the following: How many times in the 40 tosses will Peter be in the lead?...We adopt the convention that, when Peter's winnings are 0, he is in the lead if he was ahead at the previous toss and not if he was behind at the previous toss...Again, our intuition might suggest that the most likely number of times to be in the lead is " 1/2 of the time.
We can answer these questions with simulation. It turns out that simulation indicates that Peter wins about 0 cents on average, as expected (graph). However, on average, Peter is in the lead about 0% of the time or 100% of the time, 50% of the time is the least likely amount of time for Peter to be in the lead (counter-intuitive!) (graph).
At the end of section 1.1, the book discusses how computers generate random numbers. "The sequence of [random] numbers is actually completely determined by the first number. Thus, there is nothing really random about these sequences. However,they produce numbers that behave very much as theory would predict for random experiments." This is called a chaotic system.
"In modern uses martingale has several different meanings, all related to holding down, in addition to the gambling use."
sample space - set of all possible outcomes (Ω)
outcome - a possible result of an experiment
random variable - denotes the value of the outcome (typically capital roman letter such as X)
discrete sample space - "if the sample space is either finite or countably infinite"
countably infinite - "A sample space is countably infinite if the elements can be counted, i.e., can be put in one-to-one correspondence with the positive integers."
event - a "subset of a sample space"
distribution function - "a real-valued function m whose domain is Ω and which satisfies:"
probability - for any subset E of Ω (\(E \subset \Omega\)) $$ P(E) = \sum_{\omega \in E}{m(\omega)} $$
Some set rules:
$$ A \cup B = \{x | x \in A \text{ or } x \in B\} $$
$$ A \cap B = \{x | x \in A \text{ and } x \in B\} $$
$$ A - B = \{x | x \in A \text{ and } x \notin B\} $$
A is a subset of B (\(A \subset B\)) if every element in A is also an element of B.
Compliment of A (\(\tilde{A}\) or \(\overline{A}\) or \(A^\complement\) or \(A^\prime\)...):
$$ \tilde{A} = \{ x | x \in \Omega \text{ and } x \notin A\} $$
More rules:
tree diagram - root on left, leaves on right. "A path through the tree corresponds to a possible outcome of the experiment". Example diagram from mathisfun.com
uniform distribution - on a sample space Ω containing n elements is: \(m(\omega) = \frac{1}{n}\) for every \(\omega \in \Omega\).
"The decision as to which distribution function to select to describe an experiment is not a part of the basic mathematical theory of probability. The later begins only when the sample space and the distribution function have already been defined."
odds - "If the odds are r in s in favor of an event E occurring" then: \(P(E) = \frac{r}{r+s}\)
Daniel H. Pink's thesis is that behaviorism, the carrots and sticks approach to motivation, is counter productive. People are internally motivated by autonomy, mastery, and purpose. If you reward and punish their behavior, you snuff out these internal drives. He props up his claims with lots of studies and experiments.
For algorithmic work, carrots and sticks works fine. If work involves no creativity, no thought, and could be carried out perfectly by an algorithmic robot, then carrots and sticks works fine. Carrots and sticks help motivate us to do things we don't want to do.
However if there is any opportunity for autonomy, mastery, or purpose in your work, behaviorism actually hurts motivation. Rewarding/punishing people turns play into work. Rewards extinguish intrinsic motivation, diminish performance, crush creativity, crowd out good behavior, encourage cheating, shortcuts, and unethical behavior, become addictive, and promote faster short term thinking.
The key is to take money off the table. Make sure your employees are paid a fair salary, and maybe even a bit more than fair, then get money off the table. This works with children too. Pay them their allowance and have them do their chores, but don't pay them to do their chores or you're teaching them that chores are undesirable and shouldn't be done unless rewarded. Don't reward/punish them into compliance, challenge them into engagement. Help them develop autonomy and mastery through their work. Help them find purpose in their work.
Daniel H. Pink, the author, describes two studies that indicate motivation and success aren't as simple as the carrot and stick. (1) The monkeys solved puzzles worse when rewarded with raisins. (2) people performed worse when motivated with money.
Microsoft's encyclopedia (Encarta) was created by paid professionals and sold for a price. Wikipedia is created by hobbyists and enthusiasts and is delivered for free. Daniel claims that any rational economist back in 1990 would have said the one with paid professionals would succeed and the one run and authored by unpaid hobbyists would fail. Wikipedia prevailed and Microsoft's encyclopedia failed. There's got to be something more here than carrots and sticks.
Other than whether the authors were paid, I think the fact that wikipedia is free also might have contributed to its success. But I see Daniel's point. I'd expect unpaid people to do incredibly worse than well compensated professionals.
Daniel says "success would earn them nothing," referring to the unpaid authors and editors of Wikipedia. I disagree. People get to feel like they are doing something constructive for the people they care about and humanity. Success earns them pride, expertise, and reputation to name a few.
Operating system of our culture consists of things guiding our behavior such as laws and norms. Operating system 1 (wild): Survive. Operating system 2 (first societies): reward and punishment. Maslow and McGreggor said people have higher drives…Operating system 2.1.
Wikipedia, Firefox, Linux, and Apache are good examples of very successful projects created by volunteers. These projects don't compensate their contributors with extrinsic rewards. They rely on intrinsic motivation. For instance, some contributors want to build reputation and skills. Some studies of opensource projects found that volunteers are attracted by the creativity of their work, fun of mastering, and giving a gift to their community.
Some organization types to note:
Pick says that he learned that economics is the study of behavior, not the study of money. People do what's in their best interest. Economics thinks we are irrational for declining free money for sake of dignity/revenge. Economics says we are irrational for leaving a high paying job for a lower paying job that helps with one's sense of purpose. I think people are being rational; they are doing what makes them happy. Money isn't the be-all and end-all to happiness. People are driven by pain and pleasure, but pain and pleasure comes internally as well as externally. This is my main takeaway from what Daniel has said thus far: business and economics used to only look at external motivation (pain and pleasure from environment). Now they are realizing that intrinsic motivation drives people (pain and pleasure (pride, mastery, regret) from within).
Controlling extrinsic motivation helps with algorithmic work, but hurts heuristic, creative work. Study found (ref?) that adding extrinsic reward can dampen motivation and hamper creativity.
Operating system 2.0 (carrot and stick) assumes work is not enjoyable. However creative work is enjoyable. Great example: vocation vacation—people pay to work (eg chef, bicycle shop).
Companies need self motivated individuals. Daniel talked with a business owner who said "if you need me to motivate you, I probably don't want to hire you."
mixing reward with creative and algorithmic tasks reduces internal motivation
"type I and type X"
This chapter recalls a few dichotomous ways to characterize people, then Daniel introduces his own. Along with mentioning the dichotomous methods of characterizing people, Daniel brings up a few points to illustrate that internal motivation works better than external.
He talks about Friedman's type A and type B personalities then McGregor's theory X and theory Y.
McGregor's theory X and theory Y were new to me. Theory X says that people are lazy, need to be driven by management, and work solely for income. Theory Y says that people work to better themselves and are internally motivated, they don't need managers to drive them. I hope you can tell where Daniel went from here. One thing to note is that he keeps reiterating that the concepts from this book have been around for a long time, but businesses and management have not adapted to the new knowledge yet. Many of them still operate on theory X.
At the end, Daniel introduces his own behavior classification scheme: type I and type X. Type I is motivated internally and type X is motivated externally. Type X's are motivated by money, fame, and beauty. Type I's are motivated by autonomy, mastery, and purpose. He says just like with type A and type B, everyone is a bit of both and also says anyone can switch from type X to type I.
Compensation: For type X's "money is the table" whereas for type I's, enough money allows them to focus on what they really want (internal rewards).
I couldn't help but spin my own two cents into his discussion. I think there's another way to look at motivation that drives behavior: short term or long term reward. I witness a lot of teenagers and young adults who don't have any long term goals. They esteem quotes like "live for the moment" and "live as if there's no tomorrow." Their focus and their motivation are always set rewards obtainable in the next day or week or possibly month. The lengthiest perspective is that some go to school so they can make more money in a few years, but most of the time they go to school because its the norm and college is the best years of one's life. No one (I've met) who thinks like this is considering what the impacts of their decisions will be when they are 40 or 70.
"Autonomy"
Gunther - employees aren't resources, they're partners.
ROWE - results only work environment
Good managers must resist the urge to control people. Instead, their job is to awake the sense of autonomy in their employees (partners).
4 essentials/dimensions to autonomy
People need freedom, if people were just plastic they wouldn't resist being controlled so much. We have an inner need to feel like we control ourselves.
"Mastery"
Using carrot and stick leads to compliance. Using internal rewards (autonomy, mastery, purpose) leads to engagement. Compliance may get you through the day, but engagement will get you through the night (paraphrased quote).
Daniel spoke much of Csikszentmihalyi's research into happiness and what Csikszentmihalyi called autotelic (auto=self, telic=goal) experiences. Csikszentmihalyi later found out that the colloquial term for autotelic experiences is flow.
One essential for flow is that the work must be in the "goldilocks zone" in terms of difficulty for the subject. If the task is too easy, it fosters boredom. If it is too difficult, it creates anxiety. Flow-centric work environments try to help employees find tasks that are in this goldilocks zone—not too easy nor too difficult.
I found the following notes helpful when trying to spell Csikszentmihalyi's name: carpediem101.com. Egil's notes and mind maps are pretty cool!
Mastery involves finding these activities that put you into flow, where the effort itself is the reward.
Flow happens in a moment, but mastery occurs over a lifetime. Flow is not sufficient for mastery, but is essential.
The 3 laws of mastery
Flow is the oxygen of the soul. Csikszentmihalyi did a study where he asked people to identify what they like to do that they don't have to do. He asked them to do none of these things. Only do what they have to do, don't do what they like to do. After just two days, plunge of mood, people showed signs of psychologically ill. We need flow to survive.
"Purpose"
Purpose is the third leg of internal drive.
At their 60th birthday, people typically have a big reflection moment. They ask: "When am I going to do something that matters?"
MBAs new oath after the 2000 mayhem: to serve beyond the bottom line.
Someone who studies workplaces has one key way to evaluate the workplace. Is the workplace a "they" workplace or a "we" workplace. This person (name eludes) listens to the common person at the workplace to listen to if they refer to the organization as "we" or "they". "They" give us requirements to comply with. "We" operate with purpose towards a goal.
"Type-I toolkit"
"What is your sentence?" If you are remembered by one sentence, what will it be? He was the one who ____. She developed ____. She raised a happy family and 4 successful children. What is your sentence?
Take a sagmeister (a sabbatical). Take a year off every 7 years instead of waiting until you're retired to vacation and develop purpose. What about 3 months every 2 years? That sounds good to me.
Give yourself a performance review. Where are you trying to go? What have you made progress on? What are your weak areas?
Practice != deliberate practice. People practice tennis once a week for their entire lives but do not become skilled like the professionals. Professionals deliberately work on their weak points, practice the mundane, and become masters through long, grueling, practice.
Ask yourself: what keeps you up at night? What gets you up in the morning? These are related to your purpose.
Rather than commanding employees on what to do, consider using lingo like "consider it", or "think about it".
The goal with compensation: take money off the table. They need to feel like they are paid fairly, and maybe a tad bit higher than average, then no bonuses. Money is off the table. Paying a salary a bit higher than average helps with turnover, talent, and moral.
"don't bribe into compliance, challenge into engagement." Think about the assignments you give your students. They must understand how they have autonomy, how the task builds mastery, and how it relates to the larger picture (purpose). If your assignment doesn't meet these, fix it.
"Praise effort and strategy, not intelligence" (Dwek's insight). When children do well, give them specific feedback about what technique was good, and praise their effort (encourages mastery through hard work).
Praise in private. Praise is not a ceremony, it's an opportunity for feedback.
Don't offer false praise, kids can smell the insincerity.
Kids naturally want to learn and are curious. Educators should act as facilitators and mentors, not commanders and lecturers.
Unschoolers - kids choose what to learn and at the depth they want to learn it.
I joined Amazon a little over a month ago as a Software Development Engineer. Soon I will make a post about joining and my experience thus far, but for now lets talk about Analyticon!
Analyticon is an Amazon-internal conference about data analysis / data science. A week or so after I joined my team, my manager extended me an opportunity to attend. I find this remarkable: I'm a new SDE hire, but my manager knows enough about how I want to grow and cares enough about my career development to send me to a conference which doesn't directly correspond to my current job title. That's awesome! Without hesitation I accepted.
I wanted to spend my time as productively as possible, so before attending I set up some goals for how to spend my time and effort.
Like most conferences, there were several presentations. There was one presentation in particular I found very valuable. The presenter discussed Amazon's "Working Backwards Process" and simplicity as they relate to analysts.
Amazon's mission is to be "Earth's Most Customer-Centric Company." As such, we've developed the "Working Backwards Process" which starts with our customers' needs then works backwards to the products that can satisfy their needs. On a side note, the sales book I just finished promotes this same perspective.
This presenter also focused on simplicity.
I had the opportunity to have extended conversations with a few professionals from Amazon. They offered advice and helped me develop some direction.
After attending the conference, here's my more developed plan (improved from "study and work on ML-related topics and projects"):
Below are my notes for the final chapter in the OpenIntro statistics book!
On the books page I share some thoughts on the book as a whole.
The lab for this chapter is at joshterrell805/OpenIntro_Statistics_Labs lab#8.
Multiple regression fits a line to multiple variables (k of them) and one outcome variable. It does so by minimizing the sum of squared residuals. See chapter#7 notes for a discussion on residuals.
\(\hat{y} = \beta_0 + \beta_1x_1\ + \dots + \beta_kx_k\)
"While we remain cautious about making any casual interpretations using multiple regression, such models are a common first step in providing evidence of a causal connection."
"Two predictor variables are collinear (pronounced as co-linear) when they are correlated, and this collinearity complicates model estimation."
"Confounding: A situation in which the effect or association between an exposure and outcome is distorted by the presence of another variable. Positive confounding (when the observed association is biased away from the null) and negative confounding (when the observed association is biased toward the null) both occur" - PennState. Random assignments of a random sample of the population to control/treatment groups avoids confounders. Matching can remove the effect of known confounding variables, but there may be unknown variables that confound the studied effect.
Recall from the last chapter, R2 is the amount of variance in the response explained by the regression line:
\(R^2 = 1 - \frac{\text{variability in residuals}}{\text{variability in outcome}} = 1 - \frac{Var(e_i)}{Var(y_i)}\)
The adjusted R2 is a better estimate when using multiple regression:
\(R_{adj}^2 = 1 - \frac{Var(e_i)}{Var(y_i)} \cdot \frac{n - 1}{n - k - 1}\)
where n is the number of cases used to fit the model and k is the number of predictor variables used.
"Sometimes including variables that are not evidently important can actually reduce the accuracy of predictions."
full model - "the model that includes all available explanatory variables"
Backward elimination and forward selection are "two common strategies for adding or removing variables." They are referred to as stepwise model selection methods. Backward elimination starts with the full model and eliminates variables one by one until the \(R_{adj}^2\) can't be improved. Forward selection adds one variable at a time until the \(R_{adj}^2\) can't be improved. "There is no guarantee that backward elimination and forward selection will arrive at the same model."
Sometimes, instead of using \(R_{adj}^2\) to evaluate each model when doing stepwise model selection, people use the p-value. They do this because they are more interested in only including "variables that are statistically significant predictors of the response" than creating the model with the best predictive accuracy.
Multiple-regression conditions:
"All models are wrong, but some are useful" - George E.P. Box. "Reporting a flawed model can be reasonable so long as we are clear and report the model's assumptions…If model assumptions are very clearly violated, consider a new model."
"Confidence intervals for coefficients in multiple regression can be computed using the same formula as in the single predictor model:"
\(b_i \pm t_{df}^*SE_{b_i}\)
logistic regression - "a tool for building models when there is a categorical response variable with two levels"
"Logistic Regression is a type of generalized linear model (GLM) for response variables where multiple regression does not work very well. In particular, the response variable in these settings often takes a form where residuals look completely different from the normal distribution."
"GLMs can be thought of as a two-stage modeling approach:"
"The outcome variable for a GLM is denoted by \(Y_i\) where the index i is used to represent observation i."
logit transformation allows us to squeeze the range (-inf, +inf) to [0, 1] so we can use linear regression but output a probability instead of a value that can exceed 1 or be less than 0.
\(logit(p_i) = ln(\frac{p_i}{1 - p_i})\)
\(ln(\frac{p_i}{1 - p_i}) = \beta_0 + \beta_1x_{1,i}\ + \dots + \beta_kx_{k,i}\)
Conditions for logistic regression
We can "use transformations or other techniques that...help us include strongly skewed numerical variables as predictors."
While looking up a good probability book to read, I came across Aidan Lyon's Philosophy of Probability. It was interesting and in a language that was easy to read (not lots of jargon). I jot down some brief notes about what I found interesting or relevant below.
The two questions Aidan explores:
First Aidan makes clear that probability is used in many branches of science. Probability is not just used in theoretical subjects like math and statistics, but fields such as biology and quantum mechanics also heavily rely on the theory of probability. His point: these discussions are relevant as they influence a lot of science.
There are two kinds of probability
Ω is the set of all elementary events. For instance, if we were rolling a 6-sided die, Ω = {1, 2, 4, 5, 6} (either a 1, 2, .. 6 is rolled if a die is rolled).
\(\mathcal{F}\) is the set of all sets of events that can be constructed from Ω. \(\mathcal{F} = \{\varnothing, \Omega, \{1\}, \{2\},\dots, \{1, 2\},\dots\{1, 2, 5, 7\},\dots\}\)
"closed under Ω-complementation": If A is in \(\mathcal{F}\) then so is its compliment, Ω\A. (Ω\A means the compliment of A). Ex: if A = {3, 5, 6} then Ω\A = {1, 2, 4} is in \(\mathcal{F}\).
"closed under union": If any two events are in \(\mathcal{F}\), then so is their union. Ex: {1, 2} and {3, 4} are in \(\mathcal{F}\), then so is {1, 2} ∪ {3, 4} = {1, 2, 3, 4}.
If a set is both closed under Ω-complementation and closed under union, then that set "is an algebra on Ω"
If a set is an algebra, then it follows that it is "closed under intersection" (can't intersect two sets in the algebra to create set that is not already in the algebra).
\(\mathcal{F} = \{\varnothing, \Omega, \{1, 3, 5\}, \{2, 4, 6\}\}\) is an example of an algebra.
Any function that satisfied these constraints is a probability function.
Any Ω, \(\mathcal{F}\), and P that satisfy constraints is called a probability space.
When \(\mathcal{F}\) is countably infinite, use KP4 instead of KP3.
"This fourth axiom—known as countable additivity—is by far the most controversial."
Bruno de Finetti's example: What if we have a countably infinite set where all events have an equal probability.
Another problem with this set of axioms is it defines absolute/unconditional probabilities as the basic units, and derives conditional probabilities in terms of absolutes:
Emile Borel's example of why defining conditionals in terms of absolutes is a poor choice: What is the probability that a point lies in the western hemisphere given that the point lies on the equator. Intuitively the answer is 1/2, but given the theory, the answer is undefined. The probability of a point laying on the equator is 0.
One solution is to define absolute probabilities in terms of conditional probabilities, using conditionals as the basic unit of probability.
Sometimes absolute probabilities such as P(A ∩ B) and P(B) are undefined, but P(A, B) is defined.
Example by Alan Hajek: what is the conditional probability that a coin comes up heads given I toss the coin fairly. Surely the answer is 1/2, but you have no information on how to determine what is the probability that I toss the coin fairly. In Kolmogorov's system, the answer is undefined since P(B) is undefined.
In classical terms, the probability of an event is the ratio of all the ways the event can occur divided by the total number of events.
There's a problem that occurs when this definition is used together with the Principle of Indifference.
The principle of indifference states that if you have n mutually exclusive events which are indistinguishable except by name, then each event should be assigned probability 1/n.
Aidan gives an example with boxes.
Suppose a machine randomly makes cube boxes with a side length between 0 and 1 foot. Lets say we make two events:
Then the principle of indifference says that we should assign both events the same probability, 1/2. Sounds reasonable.
Now forget about side length for a moment. Suppose we have the same machine which randomly makes cube boxes with a side's surface area between 0 and 1 foot squared. Lets say we make 4 events:
The principle of indifference says that we should assign all four events the same probability, 1/4. Sounds reasonable.
But now lets look at both examples together. In the side-length example, we said the probability of making a cube box with a side length of 0-1/2 was 1/2. In the side's surface area example, we said the probability of making a cube box with a side's surface area between 0-1/4 was 1/4. Geometrically, these are the same event (surface area is side length squared), but using the principle of indifferences and seemingly equally likely events, we came to two different conclusions about what the probability of the events should be.
There are some alternative views of probability that try to deal with this issue.
The probability of event A occurring is the number of outcomes where A occurs divided by the number of trials in the experiment. A problem with this lies in changing the number of trials in the experiment. If we have 1 trial, the probability of A occurring is either 0 or 1. If we have 2 trials, the probability is either 0, 1/2, or 1. Etc.
The probability of event A occurring is the number of outcomes where A occurs divide by the number of trials in the experiment if we were to have an infinite amount of trials.
The probability of event A occurring is not the frequency, but the tendency, disposition, or propensity for A to occur (Popper).
Some subjective view of probability dealing with the dutch book. Essentially the probability of A is what a rational person believes the odds of A occurring are.
This post contains my notes for the book "New Sales. Simplified: The Essential Handbook for Prospecting and New Business" by Mike Weinberg.
Amazon: https://www.amazon.com/dp/0814431771
Sales is about understanding customers' needs and showing them how you can fulfil their needs (with your product/service).
Most salespeople are afraid of prospecting—finding new customers.
The purpose of meeting with a prospect is not to convince them how great your company or product is. The purpose is to identify what the customers needs are and to help connect them with your product if it will fulfil their need.
Salespeople need to look for and exposes the clients' pains and needs. Sales is more about asking questions and listening than about talking or convincing.
Articulating value is the salesperson's job.
The salesperson's perspective: Salespeople are problem solvers/value creators. Clients are benefited by talking with the salesperson. Clients can be helped by the product, and the salesperson helps the client realize exactly where and how through questions and discussion.
"If you had a magic wand, what would you change?" -- a tool for exploring customers' pains
The sales story is the most important sales tool.
The sales story answers:
Building blocks of sales stories, in order:
The power statement is Mike's single page sales story. It is his replacement for the sales/elevator pitch.
Mike stresses a lot:
Salespeople must, for every interaction, have a clear goal and benefit for the customer. Salespeople must make it clear they care about the customer. They must not seem completely self-focused.
I have been told multiple times that sales is a skill everyone can benefit from. People don't just sell products and services, they sell themselves and their work. When you interview, you are selling yourself. When you are pitching a new project or feature, you must sell it to the stakeholders. When you offer an idea or suggestion and want others to accept it, you must sell it. I wanted to learn about how to sell because I wanted to be more effective at these sorts of interactions.
I chose this book in particular because it
I wasn't really sure how salespeople sold, but I thought that sales was about convincing people to buy your product; I thought sales was about convincing people that your product is valuable.
In one sense, sales is ultimately geared towards doing that, but Mike's suggestion is to turn the focus away from your product, and instead turn it towards the customer. Instead of convincing people that my project is great because of x, y and z, Mike suggests I talk with them about their needs, uncover their pains, and bring up only what is most relevant to them (if we are even a fit!)
If this is the way salespeople really acted, I'd not have so much of a guard against sales. I see marketing/sales as a bunch of psychological tricks geared toward getting people to want and buy something that wont make them happy. If a salesperson really tried to understand my needs and wants and sold me something that improved my life, I'd feel good about listening to them and buying from them.
I don't want to be a professional salesperson, but as I continue to grow, I will need to sell myself, my work, and my team. I will need to do so, not by focusing and talking about my strengths, but by discussing and asking about the customer's wants/needs. Once I know what they care about, I can connect what I have and what I can do with what they need.
For this book, my intent was to develop some intuition and understanding, not to master the subject, so my notes were slim. Typically I will underline and take notes while reading, but did not do so for this book because I listened to this book rather than reading it.
Weinberg defined sales as (paraphrase) understanding people's needs and helping them fulfil their needs. It's the salesperson's job to help people get what they want/need.
He says the primary problem with sales is that salespeople, especially in today's age, are afraid of prospecting—going out and finding new customers. With the advent of social media, the 2000's boom, and sales-related software, salespeople have not felt the need to prospect. They had clients come to them. However a successful salesperson is one who goes out and finds new customers and meets their needs.
He also says that sales managers used to be more of mentors then managers. They used to teach their team how to sell and prospect, now they just tell them to update CRM records and the like.
This chapter is about "the not so sweet" 16 reasons salespeople fail at prospecting.
The reasons that stuck out to me the most follow. These items stuck out to me because they are relevant to my work (not just sales) and I see value in them. Some of them I am already good at avoiding, some of them I can use improvement in.
Purpose of meeting/call is to find pain/need. More about listening than talking.
The following bullets are questions that Mike mentioned. These are questions a provider should be able to answer about her clients. They help us discover who our target customers are. Later at the end of the book, Mike says we can talk to our current customers and ask them questions related to these, like why they chose us and continue to do business with us.
Success is not about working hard; it's about "tipping the needle". Focus on the clients who are most influential to your success:
In general this is a message of prioritize your efforts to pay the greatest returns.
Most important sales tool is the "sales story." Nobody wants to hear about you, they want to hear about how you can help them. Typical story is about the seller "we do this, we are great, bla bla bla."
Begin by talking about pains of client/benefits to client.
Differentiation is key, it's what gets clients to listen, creates intrigue, and justifies premium price.
A premium price requires a premium story.
Articulating value is the salesperson's job. Salespeople are problem solvers/value creators; they enter with confidence when they have a good sales story because they feel clients need them.
Note to self: So selling yourself and your services, then, is not about saying your strengths, it's about connecting your abilities with the client's/employer's needs and wants.
Focus on the customer and what it can do for them, not on your product and its greatness.
Must pass the so what question? Talk about what matters to them.
Building blocks of sales stories, in order:
The power statement is the single page sales pitch/elevator pitch/ etc.
Power statement is internal, not a handout.
Discovery precedes presentation. Don't "show up and throw up". Presenting != sales. Don't talk a disproportionate amount of the time. It's a dialogue.
Each point is intentional, customer focused... How can you help them solve their problems or improve?
Think of yourself as a doctor, fixing issues. You need to have client trust your competence, so give brief statement of competence, and figure out how you can help your patient. DONT spend 1hr talking about self, need to focus on patient.
When you win, don't act like it's your first completed sales call; don't bow down with tons of gratitude. It's okay to thank them for their time, but remember they should be getting at least as much value out of this sale as you are.
Sales people have inherited negative stereotypes. Know how people perceive you. Do you talk too much?
The typical salesperson's motivation for calling is completely self motivated, and it comes off that way. You must have a second goal for each interaction, one for the customer, and make it clear you care about them. Must have attitude of mutual benefit.
Successful salespeople don't have anger for leads and resentment for those who turn them down. They like leads and want to help them.
No one cares how smart/cool you think you are, or how great your product is; they want to know what you can do for them.
Even if asked to present, you have no business presenting if you don't know the customer's situation.
Ask probing questions (after power statement).
Slides can be helpful, keep it to 4 at beginning of call:
Then...
The last 3 chapters of this book were disappointing. Mike listed a bunch of suggestions in 15 without diving deep into them. They were just a bunch of disorganized tips. In 16 he spent several minutes trying to sell Southwest Airlines to his readers. 17 was the legitimate wrap-up, but by time it came around he had partially summarized several times before.
The lab for this chapter is at joshterrell805/OpenIntro_Statistics_Labs lab#7.
Linear regression should only be used when the data appear to have a linear relationship.
"A 'hat' on a y is used to signify that this is an estimate." \(\hat{y}\) is the estimate or predicted value for y.
"Residuals are the leftover variation in the data after accounting for the model fit: Data = Fit + Residuals."
residual - "the vertical distance from the observation to the line." If the point lies above the line, the residual is positive, if the point is on the line, the residual is 0, and if it is below the line, the residual is negative.
residual plot - plot a horizontal line. For each point, plot the point at its original x location along the horizontal line, but plot its height as the residual value. So if a point has a residual of +2, it is two units above the residual line.
"Correlation, which always takes values between -1 and 1, describes the strength of the linear relationship between two variables. We denote the correlation by \(R\).
least squares regression minimizes the squared residuals.
conditions for the least squares line
"The slope of the least squares line can be estimated by:"
$$ b_1 = \frac{s_y}{s_x}R $$
"where R is the correlation between the two variables, and \(s_x\) and \(s_y\) are the sample standard deviations of the explanatory variable and the response, respectively."
"The point \((\bar{x}, \bar{y})\) is on the least squares line."
Point-slope form:
$$ y - y_0 = slope (x - x_0) $$
When using statistical software to fit a line to data, a table like the one below is generated. I copied this table from chapter 7 in the book. This table models the amount of student aid a student receives as a function of their family's income. The units of estimate and standard error are in thousands (so first cell is 25.3193 * 1000 dollars).
-----------------------------------------------------------
Estimate Std. Error t value Pr(>|t|)
-----------------------------------------------------------
(Intercept) 25.3193 1.2915 18.83 0.0000
family_income -0.0431 0.0108 -3.98 0.0002
-----------------------------------------------------------
The first row is the intercept of the line. The intercept row holds data for the output variable when all other variables are 0.
The second row is the slope of the line.
The first column is the estimate. When family_income
is 0, the output is 25.3193 (intercept). For each unit family income increases, the output decreases by 0.0431.
The third and fourth columns are the t-value and two-sided p-value given the null hypothesis (intercept and family_income
are 0).
extrapolation is "applying a model estimate to values outside the realm of the original data…If we extrapolate, we are making an unreliable bet that the approximate linear relationship will be valid in places where it has not been analyzed."
"The R2 of a linear model describes the amount of variation in the response that is explained by the least squares line."
An indicator variable is a binary variable. It is equal to 1 if the thing it represents is present, otherwise 0.
A high leverage outlier is a point that falls far away from the center of the cloud of points.
"If one of these high leverage points does appear to actually invoke its influence on the slope of the line…then we call it an influential point. Usually we can say a point is influential if, had we fitted the line without it, the influential point would have been unusually far away from the least squares line."
Don't remove outliers without a very good reason. "Models that ignore exceptional (and interesting) cases often perform poorly." The answer to "Guided Practice 7.24" in this chapter suggests it's okay to remove outliers when they interfere with understanding the data we care about. This example removed two points that occurred during the great depression when modeling voting behavior over the last century. These two great depression points would have been influential on the model, but we don't care much about modeling voting behavior during the depression.
If you have any recommendations for educational podcasts or audio books, please send me an email!
I dislike maintenance tasks (chores) because they feel unproductive. When I am building and designing, I am learning. When I am washing dishes for the thousandth time, I feel I am not becoming better in any way. Like many other tasks we must do, washing dishes is not an activity of personal growth.
You can be productive when performing these repetitive, mindless tasks that natively lack any dimension of personal development. You can be productive by listening to audio content (audio books or podcasts). If you are picking up kids or delivering pizza, pulling weeds, or cooking dinner, you can continue building yourself to be a happier, more successful person by learning through audio.
Driving causes me mental discomfort because I feel unproductive. To reclaim this unproductive time and turn it into an opportunity for personal growth, I've been listening to the following two podcasts on my commutes. One thing I really like about these podcasts is that they don't waste your time. Each episode has a topic and the hosts stick to it. I dislike many other podcasts I have tried because the hosts blab about their day or spend a lot of time attempting to convince you to do something for them.
SaaS is an initialism that means "Software as a Service." SaaSter is a podcast that interviews tech leaders around the world and asks them what they have learned and what makes them successful.
One reoccurring topic in SaaStr, especially the later episodes1, is of Customer Success. The host asks many leaders how they build a successful company, and many of them state it's critical to deliver value to the customer—what the customer values. You can have a great product, and you can treat people really well, and customers can be delighted to talk with you, but unless you can deliver what customer values to the customer, they won't show they value you by giving you their money and time. This topic relates pretty strongly with Amazon's principle of Customer Obsession. Rather than focusing on our product, competition, or wallets, we need to focus on the customer and helping them get what they want/need.
SaaStr doesn't teach a salesperson how to sell; it's about what makes software companies successful.
Thank you Brian Sallee (an individual I worked with at Dozuki) for recommending this podcast to me. SaaSter has many insightful episodes to listen to.
1 at the time of writing this post, there were 54 SaaStr episodes.
99% invisible is about some cool things that people typically don't know about.
One episode I can recall right off the bat was about Taipai 101. They talked about the tuned mass damper and how the engineers turned this technological necessity into a public attraction. Rather than hiding the damper like most towers do, the architects displayed the damper and people love it! There's even "Damper Babies" (google it!). I enjoyed this episode because in engineering, sometimes there's technical necessities that get in the way of our beautiful designs. Like the tuned mass damper in Taipai, I think we can take some of these ugly necessities and turn them into beautiful solutions that fit the requirements.
There was also an episode about a building in New York that was completely vulnerable to corner winds due to its design. Typically buildings are strongest at their corners, but because this building had supports between the center and the corners, and the engineers decided to use bolts instead of welding, the building was at a huge risk of destroying the New York skyline in a strong storm. A female architecture student studied this building for her school report, and it turns out that her inquiries to the head architect of this building unveiled the flaw. Without her being confident and inquisitive enough to question the professional who made it, this tower could have caused a disaster. This story is a good anecdote for why it's valuable for everyone to have the freedom to question those around them and those at the top. There's many stories of nurses too afraid to question doctors, or military folk too fearful to question their commanding officers. Because of this fear, there are dreadful consequences such as death. Amazon has learned from experiences like these, and they encourage employees at all levels to question freely and disagree with even the most elite in the company.
I am searching for more audio books/podcasts to help me build better soft skills or non-technical skills. I think audio won't be very effective for technical books (for example, audio alone would be ineffective at communicating formulas), but is a sufficient medium for books about communication, influence, history lessons, sales, marketing, management, etc.
As an example of a topic I want to learn about via audio, I am interested in listening to a salesperson book/podcast soon. I have heard from many successful people that sales is essential not just for selling products, but for selling yourself. You need to learn how to show people that you or your work is what they need rather than hoping they will discover it for themselves. Additionally, sometimes I observe myself and others wanting to help people, but ineffectively persuading them to accept our help. These people we try to help continue suffering through problems that seem easily surmountable. If I could influence/persuade/sell better, I could help people more and change my environment for the better.
If you have any audio books or podcasts to recommend, please send me an email: josh@joshterrell.com
Today I watched a youtube video of Alex Sherman discussing ten things he wish he had known at the start of his data analysis career.
Watch it on youtube: https://www.youtube.com/watch?v=e0Q7SIj2y4I
Focus on customer's issue!
Alex recommends "The Pyramid Principle" by Barbara Minto, which is about written communication.
I don't disagree with any of the points he made, but the following points are the ones that resonated the most with me:
(1) Be Modest
(2) Business Significance > Statistical Significance
(3) Porpoise, don't boil the ocean
(8) Learn, do; Learn do; speed matters more than precision
(9) Focus on the outputs, not tools
(10) Communicate Clearly
The lab for this chapter is at joshterrell805/OpenIntro_Statistics_Labs lab#6.
sample proportion (\(\hat{p}\)) - The proportion of successes in a Bernoulli sample (equal to the sample mean). \(\hat{p} = (1 + 0 + ... + 1) / n\) where (1 + 0 + ... + 1) is number of successes in the sample and n is the sample size.
The distribution of \(\hat{p}\) is nearly normal if:
"If these conditions are met, then the sampling distribution of \(\hat{p}\) is nearly normal with mean p and standard error:"
$$ SE = \sqrt{\frac{p(1-p)}{n}} $$
margin of error - "The part we add and subtract from the point estimate in a confidence interval." margin of error = \(z^* SE \)
When constructing a confidence interval, we may have to choose a sample size. We may want to make sure the margin of error is less than some amount. For instance, we may want to make sure our margin of error is no larger than 0.025 with a 95% confidence interval.
$$ ME < 0.025 \\ z^* SE \leq 0.025 \\ 1.96 \sqrt{\frac{p(1-p)}{n}} \leq 0.025 $$
"If we have an estimate of p…we could enter in that value and solve for n… It turns out that the margin of error is largest when p is 0.5, so we typically use this worst case value if no estimate of the proportion is available."
difference in two proprortions (\(\hat{p_1} - \hat{p_2}\)) is used for inference just like a difference in means for hypothesis testing and confidence intervals.
conditions for the sampling distribution of the difference in two proportions to be normal
$$ SE_{\hat{p_1} - \hat{p_2}} = \sqrt{SE^2_{\hat{p_1}} + SE^2_{\hat{p_2}}} $$
In calculations, "use the pooled proportion estimate when \(H_0\) is \(p_1 - p_2 = 0\)" $$ \hat{p} = \frac{number\ of\ "successes"}{number\ of\ cases} = \frac{\hat{p_1}n_1 + \hat{p_2}n_2}{n_1 + n_2} $$
The chi-square test can be used "for assessing a null model when data are binned. This technique is commonly used in two circumstances:"
$$ \chi^2 = \frac{(observed\ count_1 - null\ count_1)^2}{null\ count_1} + \dots + \frac{(observed\ count_k - null\ count_k)^2}{null\ count_k} $$
where \(k\) is the number of groups.
"The chi-square distribution has just one parameter called degrees of freedom (df), which influences the shape, center, and spread of the distribution."
"A large \(\chi^2\) value would suggest strong evidence favoring the alternative hypothesis."
"The p-value for this statistic is found by looking at the upper tail of this chi-square distribution. We consider the upper tail because larger values of \(\chi^2\) would provide greater evidence against the null hypothesis." (emphasis added)
Conditions for the chi-square test
One-way chi-square tests are used when each bin only has one count, two-way chi-square tests are used when each bin has two or more counts. The book gave an example of a one-way test using a jury's composition. Bins were races, and each bin contained one value, the number of jurors with that race. It also gave an example of a two-way test: google testing a new search algorithm. In this case, bins were the algorithm types (current, algo 1, algo 2) and each bin had two values/rows: the number of users who made a new search, and the number of users who did not make a new search.
For a one-way test, \(df = k - 1\) where \(k\) is the number of bins.
For a two-way test, \(df = (r-1)(c-1)\) where \(r\) is the number of rows (values per bin) and \(c\) is the number of columns (bins).
In a two way test, the same chi-square formula is used where each cell in the table contributes to the final statistic.
For a one-way test, "when examining a table with just two bins, pick a single bin and use the one-proportion methods…" (above).
For a two-way test, "when analyzing 2-by-2 contingency tables, use the two-proportion methods…" (above).
simulation - "The p-value is always derived by analyzing the null distribution of the test statistic. The normal model poorly approximates the null distribution for \(\hat{p}\) when the success-failure condition is not satisfied." Instead of using the normal model, we can use a simulation to generate the null distribution.
double as normal for two-sided tests - "We continue to use the same rule as before when computing the p-value for a two-sided test: double the single tail area." If doubling results in a p-value greater than 1, use 1 as the p-value.
The end of the chapter uses randomization to generate several samples and generate a sampling distribution for proportions. Then it uses this generated sampling distribution to determine the p-value. These randomization techniques are useful for small samples where the conditions for the normal approximation do not hold. This small sample method may be used for any sample size, "and should be considered as more accurate than the corresponding large sample technique."
I'm a Software Engineer, and this kind of computation is cheap; I'm interested in continuing later with a computational statistics book.
The labs for this chapter are at joshterrell805/OpenIntro_Statistics_Labs lab#4.1 and joshterrell805/OpenIntro_Statistics_Labs lab#4.2
"Statistical inference is concerned primarily with understanding the quality of parameter estimates."
point estimate - using a sample statistic to estimate the population parameter. For instance, using the sample mean, \(\bar{x}\), as a point estimate of the population mean, \(\mu\).
sampling variation - "estimates generally vary from one sample to another"
"Estimates are usually not exactly equal to the truth, but they get better as more data becomes available."
sampling distribution - distribution of a point estimate calculated over many samples (of fixed size). For instance, the sampling distribution of the mean is the distribution of sample means taken from some population.
standard error - the standard deviation of the sampling distribution. "It describes the typical error or uncertainty associated with the estimate."
"The standard error of the sample mean is equal to the population standard deviation divided by the square root of the sample size."
$$ SE_{\bar{x}} = \sigma_{\bar{x}} = \frac{\sigma_{x}}{\sqrt{n}} $$
We can use the sample standard deviation, \(s\), to approximate the population standard deviation, \(\sigma\), if "the sample size is at least 30 and the population distribution is not strongly skewed."
confidence interval - "a plausible range of values for the population parameter." For example, we could calculate that we are 95% confident that the true population mean of some population lies between (50.1, 52.7). The 95% confidence level is chosen, and the confidence interval, (50.1, 52.7), is calculated using the sample mean (point estimate) and standard error.
$$ CI = point\ estimate \pm z^*SE $$
where \(z^*\) corresponds to the confidence interval selected. (z* = 1.65 for 90%, 1.96 for 95%, and 2.58 for 99%).
"But what does '95% confident' mean? Suppose we took many samples and built a confidence interval from each sample…Then about 95% of those intervals would contain the actual mean, μ."
The distribution of the sample mean becomes more normal as the sample size increases due to the central limit theorem.
central limit theroem - "In its simplest form, the Central Limit Theorem states that a sum of random numbers becomes normally distributed as more and more of the random numbers are added together. The Central Limit Theorem does not require the individual random numbers be from any particular distribution, or even that the random numbers be from the same distribution. The Central Limit Theorem provides the reason why normally distributed signals are seen so widely in nature. Whenever many different random forces are interacting, the resulting pdf becomes a Gaussian." This quote is from dspguide.com ch#6. It is the best definition I have read for building understanding and intuition.
Conditions for the distribution of the sample mean being nearly normal:
"The larger the sample size, the more lenient we can be with the sample's skew." Sample and population are not typos. We typically estimate the population's skew using the sample.
"If the observations are from a simple random sample and consist of fewer than 10% of the population, then they are independent."
margin of error = \(z^*SE\)
confidence != probability
null hypothesis \(H_0\) - "often represents either a skeptical perspective or a perspective of no difference."
alternative hypothesis \(H_A\) - "often represents a new perspective, such as a possibility that there has been a change."
"The skeptic will not reject the null hypothesis (H0), unless the evidence in favor of the alternative hypothesis (HA) is so strong that she rejects H0 in favor of HA."
"Failing to find strong evidence for the alternative hypothesis is not equivalent to accepting the null hypothesis." We just say that we fail to reject the null (default) hypothesis because the evidence is insufficient to persuade us that the null hypothesis is false.
null value - "value of the parameter if the null hypothesis is true." The null hypothesis might be that there is no difference between the average test scores of one teacher's class and another teacher's. In this case the null value of 0 represents that we expect, by default, zero difference between the average test scores.
Type 1 Error - "rejecting the null hypothesis when H0 is actually true." (False positive)
Type 2 Error - "failing to reject the null hypothesis when HA is actually true." (False negative)
significance level \(\alpha\) - a threshold determining how often we are willing to make a type 1 error. Typically \(\alpha = 0.05\) is used, which means that 5% of the time, we will incorrectly reject the null hypothesis when the null hypothesis is actually true. We could decrease alpha, thus decreasing the likelihood of making a Type 1 Error, but "if we reduce how often we make one type of error, we generally make more of the other type."
p-value - "way of quantifying the strength of the evidence against the null hypothesis and in favor of the alternative."
p-value (formal) - "the probability of observing data at least as favorable to the alternative hypothesisas our current data set, if the null hypothesis is true."
"Always use a two-sided test unless it was made clear prior to data collection that the test should be one-sided." "Hypotheses must be set up before observing the data. If they are not, the test should be two-sided."
"The significance level selected for a test should reflect the consequences associated with Type 1 and Type 2 Errors." "If making a Type 1 Error is dangerous or especially costly, we should choose a small significance level (e.g. 0.01)." "If a Type 2 Error is relatively more dangerous or much more costly than a Type 1 Error, then we should choose a higher significance level (e.g. 0.10)."
unbiased (point estimate) - "A point estimate is unbiased if the sampling distribution of the point estimate is centered at the parameter it estimates." We can apply confidence interval and hypothesis testing methods to unbiased point estimates since their sampling distributions approximate the normal model.
The lab for this chapter is at joshterrell805/OpenIntro_Statistics_Labs lab#3.
normal == gaussian
standard normal distribution - normal curve with \(\mu = 0, \sigma = 1\) where \(\mu\) is the mean and \(\sigma\) is the standard deviation of the curve.
z-score - "the number of standard deviations [an observation] falls above or below the mean"
$$ Z = \frac{x - \mu}{\sigma} $$
percentile - the percentage of observations that fall below a given threshold. If Ann did better than 84% of SAT test takers, then "Ann is in the 84th percentile of test takers."
68–95–99.7 rule - in the normal distribution, 68% of the data lie within 1 standard deviation of the mean, 95% lie within 2 standard deviations, and 99.7% of the observations lie within 3 standard deviations of the mean. This rule can help with approximations.
normal probability plot (aka quantile quantile (qq) plot) - "the closer the points are to a perfect straight line, the more confident we can be that the data follow a normal model." Examples of qq plots can be found in this chapter's lab.
bernoulli random variable - if "an individual trial only has two possible outcomes." E.g. heads/tails or win/lose. Typically one possible outcome is labeled as success, 1, and one outcome is labeled as failure, 0.
sample proportion (\(\hat{p}\)) - the sample mean of a sample of bernoulli observations.
\(p\) is the probability of observing a success, or the population mean (\(\mu = p\)).
\(\sigma\) is the standard deviation of the population \(\sigma = \sqrt{p(1 - p)}\)
geometric distribution - "describes the waiting time until a success for independent and identically distributed (iid) bernoulli random variables"
iid - independent and identically distributed. "[independent] means the individuals in the example don't affect each other, and identical means they each have the same probability of success."
probability of observing the first success on the nth trial (n-1 failures, 1 success):
$$ (1 - p)^{n-1}p $$
mean, expected value, or expected number of observations until observing the first success: $$ \mu = \frac{1}{p} $$
variance of the wait time until observing the first success: $$ \sigma^2 = \frac{1 - p}{p^2} $$
binomial distribution - "describes the probability of having exactly k successes in n independent bernoulli trials with probability of success p."
$$ \binom{n}{k}p^k(1-p)^{n-k} $$
mean, expected number of successes in n trials with p probability of success: $$ \mu = np $$
variance in the expected number of successes in n trials: $$ \sigma^2 = np(1-p) $$
normal approximation of the binomial distribution - "The binomial distribution with probability of success p is nearly normal when the sample size n is sufficiently large that \(np\) and \(n(1-p)\) are both at least 10." Use previous formulas for the mean and standard deviation of the normal distribution. "The normal approximation...tends to perform poorly when estimating the probability of a small range of counts, even when the conditions [above] are met." To improve the accuracy of the normal approximation of the binomial distribution for intervals of values (i.e. the probability that between 15 and 20 successes are observed in 20 successes), "the cutoff values for the lower end...should be reduced by 0.5, and the cutoff value for the upper end should be increased by 0.5. (Continuing the previous example, we should use 14.5 and 20.5 as the limits when finding the area under the normal curve).
negative binomial distribution - "The geometric distribution describes the probability of observing the first success on the nth trial. The negative binomial distribution is more general: it describes the probability of observing the kth success on the nth trial...All trials are assumed to be independent."
$$ \binom{n-1}{k-1}p^k(1-p)^{n-k} $$
Think about it: in n-1 trials, we need to observe exactly k-1 successes (binomial distribution). On the last trial, we observe a success, so the binomial distribution would be as follows, and we'd just have to multiply it by \(p\) to account for the last success:
$$ \binom{n-1}{k-1}p^{k-1}(1-p)^{(n-1) - (k-1)} $$
poisson distribution - "useful for estimating the number of events in a large population over a unit of time"
rate (λ) in the poisson distribution "is the average number of occurrences in a mostly-fixed population per unit of time." Eg: about λ = 4.4 individuals per day are hospitalized for acute myocardial infraction in New York City(example from book).
probability of observing k events in the time unit of λ: $$ \frac{\lambda^ke^{-\lambda}}{k!} $$
mean = variance: $$ \mu = \sigma^2 = \lambda $$
Welcome to my new blog!
I'm back to using my own software. Back in decemeber last year I made a post about switching to a static site. I've been using Hexo for almost a year now. It works well, but I have two complaints:
Rendering just 14 posts in hexo took 15 seconds on a 1GB digital ocean machine. There's something hexo is doing dramatically wrong, because just calling hexo --help
takes 4 seconds!
My new site takes about half a second to render all 14 posts, currently. This is with almost no cacheing implemented. The only caching I did was pretty cheap: I just make sure I don't read a file from disk more than once. But there's no time diff checking to prevent me from re-rendering content that hasn't changed.
Arguably, I could have spent a few hours reading the docs and code to figure out how to add my own pages, and remove the content I didn't want. But from messing around with the code here and there over the last few months, the task seemed daunting.
My current setup is bare-bones; it's only what I want. The actual logic is:
If I want to edit things two years down the road, I have a clear entry point and less than 100 lines of code to read through to understand the data-flow.
I started doing some very interesting things with neural networks and textual documents in the last month for SentiMetrix. Something I've been putting off for a while is understanding how word2vec works. Now I am interested in how one might build a model like word2vec, but that doesn't treat each word as a separate entity. With word2vec, "awesome" and "awsome" are treated as two entirely different words. "Awesome" might have id#57 and "awsome" might have id#9992. It is only though looking at many contexts that w2v would be able to infer that awesome and awsome, because they are used in the same contexts, are related. First, before we dive into my current thoughts, lets cover some of theory…
Bag of words is one of the simplest models of how to represent a piece of text. For each document, the bag of words model counts how many times each word occurs. The bag of words representation for the document "the cat ate the mouse" is \(\{mouse:1, cat:1, the:2, ate:1\}\). One problem with the bag of words model is that it doesn't take context or word ordering into account. Socher gives an example of why this is not optimal: "For instance, while the two phrases 'white blood cells destroying an infection' and 'an infection destroying white blood cells' have the same bag-of-words representation, the former is a positive reaction while the later is very negative."
Tf-idf is an improvement over bag of words which weighs words that occur less frequently in the set of documents as more important. The example on wikipedia asks us to imagine trying to find documents which are most related to the string "the brown cow". If we just use the bag of words, then the word "the" might play too much of an influence in our ranking of relevant documents. However, if we were to somehow realize that "brown" and "cow" are more important than "the", we could probably rank the documents better. The way we do this is with Tf-Idf. Each term's frequency is multiplied by the inverse document frequency—a number which is big for rare words and small for common words. The result of multiplying the term frequency by the inverse document frequency is a number that is larger as the term is more frequent in the document and larger as the term is less frequent in other documents.
Word2vec is a process which creates a vector per word such that words that are similar are close to each other. The following example is taken from blog.krecan.net. Lets imagine we have some words: car, motorcycle, lamp, cat, horse, cow, pig, lamb, pork, hamburger, pizza and sushi. Let's also imagine we have a table, and each of the words is cut out on a piece of paper. How would we arrange the words on the table such that similar words are close to one another? Here is one such solution:
With word2vec we can have more than two dimensions. In fact, rather than only having two dimensions (the width and depth of the table), word2vec can project words onto as many different dimensions as we want (typically a few hundred). The principle is still the same—words that are close to one another are related.
Having vectors for words instead of ids is awesome! The power of vectorized words that have meaning and relation to other words is being utilized in a lot of useful applications. Just try searching for "word2vec applications" or "word2vec" in your news feed!
Word2vec still falls short. First, word2vec still starts by encoding each word into an id. This means that, in the beginning at least "awesome" and "awsome" are two completely unrelated integers in the eyes of word2vec. We have to feed a lot of documents to word2vec before it is able to infer that "awesome" and "awsome" are located very closely in the vector space and are nearly interchangeable. Second, word2vec doesn't do so hot with phrases. There are some tools that can detect phrases by essentially treating sequences of words that occur often together as a single word. For example, "toilet" precedes "paper" so frequently that some tools have the ability to treat "toilet paper" as a single word—as a single id. However this amplifies the first problem. If misspelling one word is a problem, now we have two words which means (hypothetically) we are twice as likely to suffer from misspellings in "toilet paper" than in "toilet" or "paper."
Autoencoders are tools which are typically used to form compressed or simplified encodings of data. They are neural networks with the goal of predicting an exact copy of the input from the input (not a typo).
Imagine a function which has 20 inputs, 10 internal variables, and 20 outputs. The goal is to organize the function in such a way that the input gets stored completely in the 10 internal variables, then the output is created by only looking at the internal variables such that the output of the function is exactly equal to the input of the function. If we can store 20 values worth of information in 10 values, then exactly reconstruct the 20 original values from the 10 values, we have successfully developed a compression algorithm. This is only possible if there is structure in the 20 values. Also auto encoders can be used to learn a lot more than a compression algorithm. They can be used to learn any function which creates an encoding of the data that the data can be reconstructed from. Autoencoders automatically find an encoding of data.
I never ended up making a Part 2, but I want to tie up this post.
Regarding recursive auto-encoders: I will be studying these more and hope to write something about them soon. I will reserve discussion about them until that point.
Regarding the new idea: the idea was one I came up with while working for SentiMetrix. I contacted my boss and asked if he was planning on pursuing the idea, as I wanted to discuss it here. He said they might, so until they say no, I'm not going to pursue this idea. Instead, I came up with another idea which sounds pretty fun and uses recursive auto encoders to generate images. I hope to be reading the few related papers I found soon so I can work on it and post about it here.
The lab for this chapter is at joshterrell805/OpenIntro_Statistics_Labs lab#5.
t-distribution - similar to the normal distribution, but with thicker tails. Estimating the standard error from a small dataset is less accurate than using a large dataset. The thick tails of the t-distribution "resolve the problem of a poorly estimated standard error." The t-distribution is parameterized by degrees of freedom. As \(df \to \infty\), t-distribution approaches normal. The formula for degrees of freedom is: \(df = n - 1\) where \(n\) is the sample size.
conditions for using t-distribution - 1) independence of observations. 2) observations come from a nearly normal distribution. The second condition can be relaxed as sample size increases. The t-distribution eliminates the third condition, a large sample size (>30), that is needed when using the normal distribution.
Recall that the confidence interval is a range (indicated by a lower bound and an upper bound) which is X% likely to contain the true population mean. It is calculated using a sample from the population. The confidence interval for a normal distribution:
$$ \bar{x} \pm z^* \times SE $$
Where \(\bar{x}\) is the sample mean, \(SE = s / \sqrt{n} \) is the standard error of the mean, and \(z^*\) is a z-score parameterized by how confident we want the interval to be.
\(z^*\) is the number of standard deviations away from the mean that contains X% of the normal distribution. For instance, if we use \(z^* = 1.645\), then 90% of the data lies within \(z^* = 1.645\) standard deviations of the mean.
To calculate \(z^*\), we can use a table or a special calculator. We can use stat trek's normal probability calculator. In this calculator we'd leave \(\bar{x} = 0, s = 1\). We'd plug in \(P(Z \leq z) = 1-(1-X)/2\), not \(P(Z \leq z) = X\) (95% -> 97.5%, 90% -> 95%, etc). We have to adjust our X percent because this calculator asks for "the probability of drawing a value less than z" not "the probability of drawing a value between [-z, z]," which is what we want. If we want a confidence interval of 95%, we'd plug in \(P(Z \leq z) = 0.975\) to obtain \(z^* = 1.960\).
Calculating the confidence interval around a mean using the t-distribution is very similar to using the normal distribution. The only difference is, rather than multiplying by Z, we multiply by \(t\) which is additionally parameterized by the degrees of freedom, \(df\).
\[ \bar{x} \pm t_{df}^* \times SE \]
Where \(t_{df}^*\) is a t-value roughly equal to the number of standard deviations away from the mean using the t distribution. Just like \(z^*\), \(t_{df}^*\) is calculated using a table or a calculator.
Stat trek's t-distribution calculator is useful for calculating the t value. As an example, if we have \(n = 15\) samples and want a confidence interval of \(90\%\), using stat trek we can plug in \(df = 14\) and \(P(T \leq t) = 0.95\) (we want a 0.90 interval…0.05 on each side) to get \(t = 1.761\). In the confidence interval formula above, we'd plug in 1.761 for \(t_{df}^*\).
Notice that the t-value for a 90% confidence interval using n=15 samples, 1.761, is slightly larger than the z-value for a 90% confidence interval, 1.645. This will always be the case. Since we have a small sample, we are less confident, so we need a wider confidence interval, or a larger t/z value. As \(n \to \infty\), \(t \to z\).
The t-test is almost identical to the z-test. Just like when calculating a confidence interval, the only difference is whether we parameterize our z/t value with the degrees of freedom. Recall that the z-test uses a p-value to determine "the probability of observing data at least as favorable to the alternative hypothesis as our current data set, if the null hypothesis is true."
The null hypothesis is typically that two samples come from the same population (same mean and standard deviation), or that a measured sample mean and some known mean are equal. Either way, we usually assume the difference in means is 0 (e.g. the drug doesn't decrease appetite or the exam style doesn't affect test scores). If we are interested in just testing whether the means are different, we do a two-sided test. If there is reason to believe, before gathering the data, that we'd expect one mean to be larger than the other, we'd use a one-sided test. Using a one-sided test depends on the specifics of the problem (i.e. we expect a drug to improve some measure), not on the observed sample data.
To perform the t test, we need a T value indicating how different our means are. The equation for T is identical to Z: \(T = (x - \bar{x}) / s\). Because we are comparing means, not samples, we need to use the standard error of the mean, not the sample standard deviation in this formula, so: \(T = ((\bar{x_b} - \bar{x_a}) - 0) / SE\). Both formulas measure how many standard deviations the sample is from the mean. The first one assumes the sample, x, comes from a population with a mean of \(\bar{x}\) and a standard deviation of \(s\). The second formula assumes the sample, \(\bar{x_b} - \bar{x_a}\), comes from a population with a mean of \(0\) (null hypothesis) and a standard deviation of \(SE\) (standard deviation of the sample mean).
If we have two samples to compare, we can use the pooled sample variance formula to calculate the variance of both samples combined, then use that pooled variance to calculate \(SE\). If we're comparing one sample's mean with a known mean, we can just use \(\frac{s}{\sqrt{n}}\) of the sample.
Once we obtain the T value, we can plug it into the t-distribution calculator to determine the probability of obtaining a difference in means at least that large, given the degrees of freedom. If we have two samples, we can use the smaller sample number in the degrees of freedom to be more cautious (higher probability of type 2 error) or we can use a specific formula (statsdirect has formulas for both the pooled sample variance and the degrees of freedom).
As an example, if we calculated that \(t = 1.2\) with \(df = 14\), and we were doing a two-sided test, we could plug in these values to the t-distribution calculator to obtain \(P(T \leq t) = 0.8750\). So the probability of drawing a sample with a T less than \(t = 1.2\) is 0.875. We then adjust this value to get what we need: the p-value of a two-sided test is the probability of getting a T value at least as extreme as our t value. \(P(T \leq t) = 0.8750 \implies P(T \geq t) = 0.1250 \implies P(T \geq t) + P(T \leq -t) = 0.2500\). Thus our p-value is 0.25. A standard \(\alpha\) is 0.05, and with this alpha we would not reject the null hypothesis.
Note: The meaning of \(t\) and \(T\) when using the calculator are different in this context of a t-test than above when calculating a confidence interval. You just have to look at what variables the calculator allows you to plug in. This calculator allows us to specify or calculate \(t\), not \(T\). It uses \(T\) to help explain the direction of the calculation
paired observations - "each observation in one set has a special correspondence or connection with exactly one observation in the other data set." For example we may measure 10 athletes' sprint times with and without using our energy drink. Rather than comparing \(\bar{x_{none}}\) and \(\bar{x_{energy}}\), we can create a new sample which consists of 10 data points: "sprint time of subject n using the energy drink" minus "sprint time of subject n without the energy drink". We can calculate the mean of this difference sample and compare it directly to 0, the expected change in performance given the null hypothesis.
statistical power - "if there is a real effect, and the effect is large enough that it has practical value, then what's the probability that we detect that effect?" We can create a tiny p-value by just using a huge sample, but a drug decreasing someones symptoms by 0.0001%, while statistically significant, may not be practically significant. Power helps us calculate the probability of achieving a practically significant result, and it helps us determine the proper sample size to help us reduce the risks/costs of running an experiment.
effect size - practically interesting difference in means.
As an example, lets suppose a teacher gave out two versions of a quiz, A and B. She determines that a 2 point difference on the quiz is a practically significant difference; 2 is the effect size. She wants to determine the probability of detecting a 2 point difference on the quizzes when using a z/t test, or the power.
In the picture above, the null hypothesis is in blue (no difference in quiz scores), the alternative is in red (+2 point difference in quiz scores).
For this example we're going to assume the SE = 1 such that z == difference in means since that's what the picture shows.
If we were doing a t test on the quiz scores, we'd determine the p-value—the probability of observing a mean greater than or equal to the measured mean assuming that the true difference in means is 0 (the null hypothesis). If the p-score was less than \(\alpha\), we'd reject the null hypothesis. To calculate power, we ask the question: "what percentage of the alternative hypothesis lies beyond the significance threshold, \(\alpha\)?" If \(H_0\) is false and \(H_1\) is true, we will detect a difference in means only for the portion of \(H_1\) that lies beyond the significance threshold.
So, assuming \(\alpha = 0.05\), first we calculate the z-score threshold on the null hypothesis, \(z = 1.645\). When performing the z-test, if we observe a difference in means with \(z \gt 1.645\), we will reject the null hypothesis. Now lets assume the alternative hypothesis is actually true. What percentage of the alternative hypothesis lies beyond this z value? Using the calculator with a mean difference in sample means of 2, we can calculate that the probability of observing a difference in means greater than or equal to 1.645 is 0.63871—the power.
Thus, if using an alpha of 0.05 and an effect size of 2, the teacher would only observe a difference big enough reject the null hypothesis 64% of the time. To have a greater probability of detecting the effect size, or a greater power, she should increase the sample size to reduce the standard error of the mean (in this example we assumed SE = 1; with a larger n, the distributions would become narrower and more of the alternative hypothesis would lie beyond the significance threshold).
Power can also be used to determine \(n\) given \(power\)—how big your sample size should be given you want to be \(power\)% likely to find a difference at the effect size. Just solve backwards :)
data snooping/fishing - looking at the data and only afterwards deciding which parts to test. "Naturally we would pick the groups with the large differences for the formal test, leading to an inflation in the Type 1 Error rate."
prosecutor's fallacy - Confusing a marginal probability with conditional probability. Concept stew explains it well.
ANOVA conditions - "all observations must be independent, the data in each group must be nearly normal, and the variance within each group must be approximately equal.
ANOVA-F - If there are many samples to compare, we can use Anova-F to test whether the samples are different, then if there is a difference, we can use multiple two-sample t-tests to determine which samples are different after applying the Bonferroni correction.
bonferroni correction - used when testing many pairs of groups to help control type 1 error rate. \(\alpha^* = \alpha / K\) where K is the number of comparisions being made. "If there are k groups, then usually all possible pairs are compared and \(K = \frac{k(k-1)}{2}\)."
As I mentioned in one of my recent posts, we're using neural nets at SentiMetrix and my familiarity with them is less than optimal. This weekend I'm going to follow the TensorFlow tutorials so I can more effective at working with them.
I posted my work on following along with the tutorial at joshterell805/Learning_TensorFlow.
tensor - n dimensional array
one hot encoding - replace a single dimension having n distinct values with n dimensions. Each of the new n dimensions is a binary column representing the occurrence one of the distinct values.
softmax - function that converts predicted values in one-hot format (floating-point (non-binary) since they are predictions not truth labels) into probabilities. The probabilities add to one. \(softmax(\hat{y}) = normalize(exp(\hat{y}))\)
cross-entropy - is used as a cost function.
$$ cross\_entropy(y, \hat{y}) = - \sum_i{y_i log(\hat{y_i})} $$
Note: this is applied after softmax, so the cost is zero if \(\hat{y} = y\) exactly, and increases as confidence decreases in the correct class.
After doing the tutorial, I understood everything up to the cross_entropy
step in the code to a good point. placeholder
s are variables that the user must input, variable
s are variables that are calculated through steps in the graph (and you can save them to disk and restore a graph using them), and you connect the placeholder
s and variable
s by creating a graph of mathematical operations.
What I didn't understand is some magic going on in the gradient descent optimization step. We provide the cost function (entropy) to the GradientDescentOptimizer
, and somehow gradient descent is able to trace the graph back, starting from the cost function, to determine how it needs to update the bias and weights. The optimizers section of the docs doesn't explain everything down to the detail I'd need to make my own optimizer, but it does explain that there is a GraphKeys.TRAINABLE_VARIABLES
list on the graph (the cost function we provide). I'd like to learn more on how the parts connect, but I think this is enough information for now so I continue with the next tutorial.
This was cool! I think I understand the graph setup of tensorflow a bit more after working through this tutorial, and I looked up some functions along the way, so I'm developing that familiarity :)
Deep MNIST for Experts github work
The beginning of this tutorial goes back over the previous tutorial, but explaining a lot of what was missing in the previous tutorial. It explains a bit more of what's going on on the GradientDescentOptimizer line.
Next we go into building a convolutional neural net to increase the accuracy from 92% to 99.2%.
First, I'm taking a brief detour to understand a bit about CNNs. I'm reading parts of chapter 6 of Nielsen's book.
Nielsen explains that traditional NNs and DNNs don't take advantage of the 2D nature of images, however CNNs can and do.
local receptive field - the 2D region (in this problem) in the input which maps to a single hidden neuron. image
stride length - how many pixels to move the local receptive field by when creating the hidden layer. This number affects the hidden layer's size.
shared weights and biases - each of the local receptive fields map to a hidden neuron. In the NN, the weights and bias are shared between all of these local receptive field to hidden neuron mappings (weird!).
Since the weights and bias are shared amongst all local receptive fields, the same feature is detected in each of the local receptive field to hidden unit mappings. Nielsen says these shared weights and biases are often called the kernel or filter. What's effectively going on is we're sliding a window around the image looking for the same feature (eg a diagonal line). Cool!
Also, we don't have to detect just one feature in the image. We can detect multiple features anywhere in the image by making multiple, parallel, hidden first layers in the CNN.
(a question that pops to mind: okay we can detect any feature of a static size within the image, but what about detecting a small ball or a big ball anywhere within the image?)
CNN layers appear to have many less weight parameters than standard, fully connected, layers (since all parts of the image share the same weights).
pooling - downsamples the image size, essentially. Max pooling takes the 2D local receptive field, say 4x4 pixels, and outputs a value equal to the max of those 16 pixels. There's also L2 pooling (for l2 norm)
The tutorial states we're going to be using ReLu (rectifier) neurons which is a name for \(max(0, x)\). It states that we "should generally initialize weights with small amounts of noise for symmetry breaking, and to prevent 0 gradients." We should also initialize or ReLu units with a bit of positive bias to avoid dead neurons.
One very cool thing the tutorial brought up was using dropout to reduce over-fitting. Dropout does not work the way I thought it did before reading this article and the paper it linked to, and I didn't know that dropout reduced over-fitting. I've included a link to the dropout paper below.
Most of my learning was theoretical and about CNNs, but following this second tutorial was also good for practical knowledge of TensorFlow—especially learning how to create more complicated graphs and using dropout. See joshterell805/Learning_TensorFlow for my work on following along with these first two tutorials.
http://neuralnetworksanddeeplearning.com/chap3.html - seemingly good book on neural nets https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf - dropout reduces overfit!
We use a lot of neural nets at SentiMetrix. I'm still trying to develop a solid foundation in statistics before I move to my ML book and more ML papers, but I need to take a brief jump ahead and read on LSTMs so I can function better. Below are my notes on reading about the principles of LSTMs.
Long Short Term Memory networks … are a special kind of RNN. [RNNs] are networks with [feedback] loops, allowing information to persist. [colah]
[karpathy] gives an example of a simple RNN with a single hidden vector, \(h\). The network takes in an input vector, \(x\), and produces an output vector, \(y\). It looks something like: \(y = f(x)\) where \(f(x)\) is a function that multiplies \(x\) by \(h\) and updates \(h\). \(f(x)\) is stateful—the value of \(f(x)\) depends not only on the current value of the input, but on the entire history of the input.
[wildml] explains how a simple neural network is implemented. They set up a neural network with two inputs and two outputs, and a hidden layer of three nodes. They choose the activation function, which is a "function that transforms [the] inputs of the layer into its outputs," to be \(tanh(x)\) because it "performs quite well in many scenarios."
Brief aside: Sebastian Raschka's Quora answer to the role of activation functions in NNs explains a bit more on the purpose and different types of activation functions.
[wildml] also explains that we use the softmax function on the output to convert class values into class probabilities.
Back to colah's article: Colah does a great job explaining what makes an LSTM different from an RNN in the section titled "Step-by-step LSTM walk through". I still need to develop my understanding of LSTMs and RNNs, but this is enough, I think, to get me a bit more comfortable using with them.
The lab for this chapter is at joshterrell805/OpenIntro_Statistics_Labs lab#2.
probability - "proportion of times the outcome would occur if we observed the random process an infinite number of times."
disjoint - mutually exclusive events (impossible to flip a coin once and have it be both a heads and a tails).
addition rule
$$ P(A \cup B) = P(A) + P(B) - P(A \cap B) $$
Handy references:
probability distribution - "table of all disjoint outcomes and their associated probabilities"
marginal probability - probability of event A occurring without regard to any other variable. \( P(A) \) (eg probability of randomly picking a smoker without regard to income, sex, race, etc). Called marginal because they used to be found in the margins of probability tables (see wikipedia, and the book said this too iirc).
joint probability - probability of two or more events co-occurring. \( P(A \cap B) = P(A, B) \) (eg probability of randomly picking a smoker that is also a Caucasian).
conditional probability - probability of an event occurring given another event has already occurred. \( P(A \mid B) \). Pipe (|) = "given". (eg probability of a randomly picked person being a smoker (A) given that the person is female (B)).
some probability rules
$$ P(A \mid B) = \frac{P(A \cap B)}{P(B)} \iff P(A \cap B) = P(A \mid B) \cdot P(B) $$
The probability that A occurs given B is equal to the probability that both A and B occur scaled by the probability of B (left side of equivalence). Thinking about it from the equation on the right side of the equivalence makes more sense if you remember that "or" is addition in probability and "and" is multiplication. The right side of the equivalence states that the probability of "A and B occurring" is equal to the probability that "A occurs given B" and (multiplied by) the probability that "B occurs". For example, the probability that "a random person both likes hot dogs and like horror movies" (\(P(A \cap B)\)) is equal to the probability that "a random horror-movie enthusiast likes hot dogs" (\(P(A \mid B)\)) multiplied by the probability that "a random person likes horror movies" (\(P(B)\)).
$$ P(A_a \mid B) + P(A_b \mid B) + ... + P(A_z \mid B) = 1 $$
A has many different sub events (for example, A is the weather, it can be rainy, sunny, or cloudy). The probability that any of A's sub events occur given B is equal to 1. In the example, the probability that it is either rainy, sunny, or cloudy given that it is 75 degrees Fahrenheit outside is 100%.
$$ P(A \cap B) + P(A' \cap B) = P(B) $$
The probability that "A and B occured" or "not A and B occurred" is the probability that B occurred. For example, the probability that "it is sunny outside and it is 75 degrees" or "it is not sunny outside and it is 75 degrees" is equal to "the probability that it is 75 degrees".
tree diagrams are cool/useful. (see tree diagram).
bayes' theorem - inverts probability requirements. Useful when \(P(A \mid B)\) is not known, but \(P(B \mid A)\) is known.
$$ P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)} $$
sampling without replacement - in small samples, this can lead to an invalidation of the independence requirement of many analyses. For example, if there are 5 red cars and 5 blue cars, and we want to determine the probability that our random sample of two cars without replacement is purely red cars, we must model P(select red car) * P(select red car | selected one red car already), rather than P(select red car) * P(select red car). (5/10 * 4/9 != 5/10 * 5/10). If the sample size is large, we can assume independence even if using replacement.
random variable - E.g. the amount of money Suzie makes from selling ammo at the flea market can be modeled as a random variable. The variable can take on many different values, and there's a different probability of each value occurring.
expected value E(X) - of random variable is equivalent to its mean (formula). E.g. how much should Suzie expect to make from selling ammo at the flea market given all the different quantities of money she can make and their associated probabilities?
variance σ2 - of a random variable (formula). E.g. what is the standard deviation of the amount of money Suzie should make from selling ammo? How much should she expect the amount of money she makes to vary from the expected value?
linear combinations of random variables - two independent random variables combined linearly—what is the combination's expected value and variance?
probaility density function - curve of the probability distribution such that the area under the curve equals one and x values are values the variable can take on. The area under the curve in some range is the probability that the variable will take on a value in that range.
Originally I was going to release my notes to this book all in one post. On second thought, after seeing how long the post is getting only mid-way through chapter 5, I'm going to post my notes per-chapter.
I am in the process of reading the 3rd edition of OpenIntro Statistics. As I read the book, I am taking notes by marking up the pdf on my tablet. I am also solving the intra-chapter exercises and the end-of-chapter problems as I read. After reading each chapter, I complete the corresonding lab and post my work and solutions on github at joshterrell805/OpenIntro_Statistics_Labs.
Disclaimer: These are my notes from reading the book. I post them here for myself, so I can jog my memory, and for others, so they can get a quick refresher as well or get a better understanding of my experience. As I get further along in the book, I get better at indicating quotes, however I did not do perfectly throughout these notes. These are only notes, as a student might take when listening to a lecture at school. The actual book is free and publicly available at https://www.openintro.org/stat.
Without further ado, here are my notes for chapter 1.
The lab for this chapter is at joshterrell805/OpenIntro_Statistics_Labs lab#1.
Stats is collecting data, analyzing data, and making inferences from analyses.
non-response bias - bias introduced by people self-selecting whether to respond. Volunteer surveys are not random thus do not generalize to population.
convinience sample - getting data that is easy (e.g. from friends). Introduces bias; easy-to-obtain sample not generalizable population (where many cases may be difficult to obtain).
stratified sampling - break population into groups where the members within a group are similar to eachother; sample randomly from each group (strata). For example, break population into males and females then sample randomly from each gender.
cluster sampling - break population into groups where groups are similar to eachother; choose a few groups to represent population (i.e. all employees from X McDonald's restaurants in California rather than sampling randomly from every McDonald's restaurant (expensive)).
multistage sample - same as cluster but select randomly from selected clusters rather than selecting the entire cluster.
The text notes that simple random sampling is the best if possible. Extra steps need to be made when analyzing and making inferences from these other sampling techniques. TODO: What are those other steps?
blocking - population has sub groups which may be confounders (eg sex or health). Distribute groups proportionally into control and treatment to control for each confounder. For example, if whether or not a person exercises influences our dependent variable (exercise is a counfounder), we could split those who exercise (eg 20% of sample) proportionally between the control and treatment groups. Sampling completely randomly may, by way of variance, leave either the groups having disproportionate amounts of subjects who exercise.
skew - right skew = longer tail on right and mean typically > median. left skew = longer tail on left and mean typically < median.
modes - unimodal (one peak), bimodal (two peaks), and multimodal (multiple peaks) may be important in describing distribution.
Median and interquartile range are much more robust against outliers than mean and standard deviation. Whiskers of box plot are 1.5 * IQR away from Q1 and Q3. Any data beyond whiskers are outliers.
Segmented bar plots and mosaic plots are cool.
I'm interested in reading more on simulation (Monte Carlo?) and other means of determining significance in differences.
Back in high school, I taught myself to program by reading tutorials and books. A few weeks ago, I completed my B.S. in Software Engineering. I'm finally done with formal education, and I'm excited to continue learning on my own.
In this post, I summarize some of my thoughts of undergraduate college (henceforth termed "college") and explain why I am so excited to continue with self-education rather than with higher, formal eduction.
Because education was my primary goal with college, I judge college by what and how much I have learned. My overall impression of (undergraduate) college is this: college is good at broadening knowledge and decent at deepening knowledge.
College is good at is introducing you to subjects you don't particularly want to learn. If you've ever seen a degree course list, you know that there are a lot of classes you're not particularly interested in. What people are interested varies. At the time, I wasn't excited to take literature or economics. However I did learn something from these classes. For instance economics taught me about the sunk cost fallacy and anthropology and literature increased my understanding of environmental factors shaping human behavior. These classes gave me a shallow understanding of subjects I had very little experience with.
I am very glad to have taken many support classes in subjects I don't think I would have been motivated to learn about outside of college (Calculus, Physics, Statistics, Statics, Dynamics, and Combinatorics). I am happy to have learned in these subjects, as they strengthened my mathematical foundation. Without this foundation, I wouldn't have the confidence in mathematics that I now need to be a data scientist.
I went to college to become a Software Engineer. I took several classes in Software Engineering (the design, process, requirements, … of building software), several classes on programming (systems programming, intro programming 1-3, computer architecture/assembly, individual design and development), and other more application-specific classes (operating systems, databases, knowledge discovery from data, graphics, networking). These classes helped develop my understanding within the field of software engineering, and they help me build better software. I by no means feel like an expert from my education, but I do have a solid basis of knowledge in Software Engineering to move forward with.
In college you get to meet and build relationships with professors who see your efforts and abilities. These professors have connections, and some of them want to see you succeed. I am very thankful for the professors who helped me move forward outside of Cal Poly. PhD David Janzen helped me land a sweet summer research internship with PhD Emerson Murphy-Hill at NCSU. PhD Alex Dekhtyar recommended me for my awesome Data Scientist job at SentiMetrix and advised me through the beginning. Through these connections, my opportunities continue to grow. I am thankful for the professors who were my professional references in job applications, and I am also thankful for the few great teachers who have inspired me both in career and life.
College is expensive in terms of time, money, and effort. My experience with college has been a lot of work with, to be fair, less than optimal results. I spent most of college as a hard-working sponge—wanting to learn, grow, and get something more than a degree out of all my time and effort and my parents' money. Putting my best into my schoolwork was extremely demanding. It was stressful and unhealthy; I sacrificed sleep, nutrition, and exercise. That'd be fine if I reaped as much as a sewed, but I don't think that I did. There was a lot of wasted effort in college. There was a lot of useless/busy work, a lot of teachers not preparing and doing their best, and a lot of wasted time in lecture (stupid questions, irrelevant banter, repeating the book, …).
Overall, undergraduate college was valuable and worth the costs. Without college, I may have developed a deeper knowledge of subjects in less time, remained more healthy, and learned more in industry. However, if I taught myself for the last 5 years, I almost certainly would have been less broadly developed, I may not have discovered my passion for data science, and I may not have built the same quality and quantity of professional connections.
I am going to become an expert in data science, but I don't believe continuing for a Master's or PhD is the most effective route. From this point forward I am continuing my professional development by reading books and research papers, attending conferences, and learning in the industry.
Original paper: Integrating Classification and Association Rule Mining.
Bing Liu, Wynne Hsu, and Yiming Ma from the National University of Singapore
This research paper contributes two algorithms: one to gather all of the class association rules of a dataset, and another to build a classifier from a subset of the class association rules.
The authors define class association rules (CARs) as association rules where the right hand side of the rule is the class/label, and the left side is a set of feature/attribute items (1).
Unlike databases of transactions, classification datasets tend to have a huge number of association rules. Since the purpose is to find CARs, the algorithm skips association rules that are not CARs. This eliminates a huge amount of computation while still calculating the full set of CARs (1).
The CAR-mining algorithm only requires k passes over the dataset, where k is the size of the largest itemset in a CAR, so it is possible to efficiently implement the algorithm with the dataset stored on disk rather than in memory (2,3).
Some classification association rule miners mine a subset of the rules to form an accurate classifier, but the rules may not be understandable, interesting, or useful in the domain. The contributed algorithm mines all the CARs so desirable rules can be picked from the full set (1-2).
The paper mentions that the rule generation is based on Apriori. It is actually very similar to the frequent itemset generation step of Apriroi, but not the association rule generation step.
Recall that a class association rule (CAR) has one or more items on the left and a single class/label on the right. CBA-RG boils down to frequent itemset generation where the itemsets must contain at least one feature and exactly one label. Given a frequent itemset with this composition, the CAR is {feature0, feature1, .. featureN} -> {label}
.
There are some differences though, particularly:
The paper contributes two algorithms for building a classifier from the set of CARs. Both algorithms build the same classifier, but with differing levels of efficiency. The first algorithm, M1, is a simple and intuitive algorithm that makes (worst case) as many passes over the database as there are rules (4). The second algorithm, M2, adds a lot of state and complexity, but reduces the number of passes over the dataset to one to two (4, 5).
Both algorithms build a classifier made up of CARs and a default class. To classify a record, the classifier serially iterates through the rules (CARs) until finding the first rule whose itemset is a subset of the record. This first rule to match labels the record with its label. If no rules match, the record is labeled by the default class.
Both algorithms are heuristic algorithms that greedily select rules using the following rule-precedence. r1 precedes r2 if r1.confidence > r2.confidence || r1.support > r2.support || r1 generated before r2. The algorithms trim away any rules that do not correctly classify at least one record, and they stop (or trim back the rules) such that each rule in the classifier strictly decreases the error of the classifier on the dataset. If adding one more rule would cause more error than simply labeling the remaining unclassified data with the default class, the algorithms stop.
M1 is similar to the following:
M1 iterates over the database for every rule. This can be horrendously inefficient, which is why the authors made M2. M2 adds a lot of complexity to reduce the dataset iterations to one to two. However I'm not going to cover it in this post since it doesn't add much to concepts I'm interested in talking about. I do think this algorithm is a profound contribution of this paper, and anyone interested should definitely check it out (4)!
CBA requires discretizing the dataset before building the classifier. Therefore, the authors compared both C4.5 using discretized data and C4.5 using continuous data to CBA. On average, C4.5 discretized had higher error than C4.5 continuous, and CBA (discretized) had a lower error than C4.5 continuous (CBA performed better than C4.5). In the results, the authors halted rule-generation after 80,000 CARs, and they also compared M1 times vs M2 times (6).
A colleague at SentiMetrix recommended this paper to me. I've recently worked with using association rules for classification, and I've realized there are many ways to build a classifier out of association rules. Some decisions I encountered included whether to build a voting ensemble with the CARs, to remove training records when building the classifier, and/or to remove training records when building the rules. This paper is interesting because it precisely defines what a good classifier is by using precedence rules. It then contributes two algorithms to build a classifier using these rules, and evaluates their performance.
I think CARs have great potential for classification. In my little experience with them, they performed just as well as (and sometimes better than) the standard classifiers. Decision trees (like the ones C4.5 builds) are related to CARs. A decision tree can be seen as many CARs, where each path from the root to a node is a CAR. However CARs can stick to one meaningful association and leave other associations for other CARs. In trees, each CAR in a tree shares at least one item (the root), so the classifier is restricted if the amount of trees is restricted. Also you can apply one CAR before another in a CAR ensemble, whereas with a tree ensemble you can't (easily) apply one branch of a tree before another branch in a different tree.
Association rules have an advantage over other classifiers in that they have high explanatory power in the domain. (If x, y, and z then label). I look forward to working with association rules more. There are other papers on classification with association rules which I plan to read and discuss in the future.
This post is a collection of my notes and thoughts on the research paper. I may inaccurately summarize and/or infer based on my understanding. I have likely left out important concepts in the paper. Before leaving with your impressions, please verify your ideas with the source by reading the relevant parts of the paper for yourself. I provide page numbers in parentheses. These are not citations, but pointers into the paper so you can find relevant sections more easily.
We recently added obstacles to Chicken Catcher—game objects which players and chickens must navigate around. In doing so, our display-object sorting algorithm broke.
In Chicken Catcher, we render images to represent game objects. In order for the game to look physically correct, if two objects overlap, the object that is closer to the camera (in game coordinates) must be drawn after the one that is further away. If one draws the object that is closer first, the game looks very odd.
To draw the images in the correct order, we needed to sort the images. The sorting isn't so simple though. We first tried to sort the images by the the object's distance from the camera. However as the next figure shows, sorting by the object's distance doesn't work out too well.
Intuitively, if we were standing at the camera position looking towards the objects, the magenta square should appear in front of the black rectangle, which should appear in front of the cyan square.
We can see that the distance between the camera and any of quadrilaterals' centers is equal. We can't sort by the object's center point. We can also see that the rectangle has both the closest point and the furthest point from the camera. In any sort order using just the furthest or closest point of the objects, the rectangle would not be the middle object to be drawn.
It turns out there's already a working and intuitive algorithm to determine the order of any two non-intersecting (objects do not pierce each other) and overlapping (objects occupy at least one shared point on either x axis or y axis) objects, detailed here. This comparison algorithm takes any two objects and determines whether one should be in front of the other, or if the order doesn't matter.
This comparison algorithm looked very similar to the compare function which Array.sort expects. We implemented the comparison algorithm and sorted our array of objects with it, however our objects were still not sorted correctly. This puzzled us.
We could not find any problems with our implementation of the comparison algorithm. After some time debugging, we re-read Shaun LeBron's algorithm and found out it explicitly called for topological sort. After implementing a simplified version of topological sort for our objects, all objects were sorted correctly!
After getting things to work properly, even among all the other important things I had to do, my mind anxiously pondered why Array.sort didn't work but topological sort did. The first thing I did was think up the simplest set of objects where sort would not work.
Standard sorting wont work on these objects in some conditions. Using the isometric display-order algorithm we can calculate:
magenta > black — magenta is in front of (greater than) black black > cyan — black is in front of (greater than) cyan cyan = magenta — the sort-order of magenta and cyan is irrelevant (equal)
I figured out that if we tried using quicksort, the sort wouldn't work in multiple cases:
E.g. magenta as pivot: [magenta, cyan, black] -> [black, magenta, cyan]
E.g. cyan as pivot: [magenta, cyan, black] -> [magenta, cyan, black]
Black as pivot (always correct order): [magenta, cyan, black] -> [cyan, black, magenta]
After more pondering, I figured out that the transitive property does not hold for our comparison algorithm. If the transitive property held for these objects we could say:
black ≥ cyan & cyan ≥ magenta -> black ≥ magenta
…but the implied black ≥ magenta is wrong. Black is not greater than nor equal to magenta, black is less than (behind) magenta.
It turns out that the transitive property must be complied with for comparison sort to work (what javascript's Array.sort is). Our comparison algorithm did not obey the transitive property, therefore comparison sort did not work.
The solution was to use topological sort. Our topological sort created a graph where objects are nodes and an edge between a and b meant a was in front of b. Then we traversed the graph using post order tree traversal to sort the array such that the objects displayed behind other objects came first in the array.
Even though the solution already existed, it was a useful exercise to troubleshoot why comparison sort didn't work. I experienced first-hand why sort can only work with comparsion functions that comply with the transitive property. This exercise was also a reminder to have a good understanding of the algorithm before jumping into implementation.
I am transitioning my blog from my homemade website and themes to a static site generated by Hexo.
My old blog was dynamic. If a logged out user was visiting my site, they'd be denied from viewing private posts. If I was logged in, I'd be able to see all posts. I would see the actions bar allowing me to edit posts, make them private or public, and create new ones.
This new blog is static. Anyone who views this website, including me, sees the same thing and has the same functionality available to them in their browser. If I want to add new content, I have to edit files and regenerate the website.
So, why did I replace the old dynamic website with this static one? The answer is: user interface. I wanted a better document index, the ability to search for documents, and a mobile-friendly user interface. However, I'm not very skilled at writing user interfaces. I don't want to re-invent the wheel, nor spend tons of time integrating my old website with new themes. I want to spend my time on what I care most about and use a blogging solution that does what I want out of the box.
Writing my old blog was not wasted time. I gained experience with several technologies and protocols by writing the dynamic website. Before writing it, I hadn't dealt with nginx. I gained experience creating and installing ssl-certificates and maintaining security using oauth2 and csrf tokens. Even though my focus is on data science now, learning these technologies aids me in understanding what colleagues expect and need. It gives me the skills to tinker with websites, create servers, and help others.
My girlfriend and I are creating a mini game for her online community, Windlyn.
The game is called Chicken Catcher. The objective, as you may have so cleverly inferred, is to catch chickens. If you catch all the chickens before the time runs out, you procede to the next level.
We've been putting in a few hours here and there over the last couple of months. You can see our progress at chicken-catcher.joshterrell.com.
This summer I worked as a research intern at North Carolina State University. I researched under Dr Emerson Murphy-Hill for the Developer Liberation Front. My goal was to get some experience with research and decide whether to commit to a PhD.
The internship was awesome! I spent most of my time writing software and building databases to answer unanswered questions. However, after lots of reading, conversing, and thinking, I decided against doing a PhD.
This delineated my choice: Research is about inventing new ways to do things and discovering new knowledge for the purpose of extending human knowledge. Engineering is about applying tools and knowledge to build things that serve some human purpose (e.g.: entertainment, security, health).
Both research and engineering are constructive, engaging, and rewarding professions. I've seen some drawn towards one, some drawn towards the other, and people on both sides who believe their profession is superior. Making the choice was difficult because I see both research and engineering as fulfilling paths.
In my internship, I was both an engineer and a researcher. I built software and databases using existing methods (engineering), and I used this software to contribute new knowledge to the field (research). For most of the rest of my profession I've been an engineer. I've built software to help people and written tests to increase the reliability of that software.
I was originally interested in doing a PhD because I saw it as a way to become an expert. It's true, PhD graduates do become an expert at something, but that is not their purpose. The purpose of a PhD is to conduct research, and that is not my goal.
I want to be an expert at what I do. I want to build great software—to apply the research for the good. I don't need a PhD for that, and I don't think a PhD is the best way to achieve that goal. I can more effectively become an expert by learning from colleagues, reading papers and articles, and building software.
I'm also interested in working part-time, so I can spend lots of time with my family and friends, get ample sleep, work out, and homeschool future children. According to my observations, research requires a lot more time than I want to dedicate to my career. My investigation has led me to believe that not pursing a PhD will most likely bring me and my loved ones the most happiness.
Here is a collection of some of my favorite quotes. I update this post periodically.
Even though we don't cause all of our circumstances, we are responsible for them, and we experience their effects. We have the ability to change nearly all of our circumstances. — Unknown
If the only tool you have is a hammer, you tend to see every problem as a nail. — Abraham Maslow
The good life is one inspired by love and guided by knowledge. — Bertrand Russell
A decision made for life is not made once. A decision made for life is a decision made every day. — Unknown
There are two sides to every story. — Unknown
Sure, I can smoke or drink in moderation. But why? Why would I want to harm my body in moderation? — Unknown
People are taught, believe, and perpetuate many less-than-certain propositions as fact. — Unknown
We are what we repeatedly do. Excellence, then, is not an act, but a habit. — Aristotle
Being your best is not about perfection; being your best is about incessant improvement. — Unknown
I'd rather be hated for who I am, than loved for who I am not. — Kurt Cobain
The greatest of all weaknesses is the fear of appearing weak. — Jacques Benigne Bossuet
We don't choose what to believe; our experiences dictate our beliefs. — Unknown
If it will hurt now and it will hurt more later, you're better off doing it now. — Pragrmatic Programmer (p187) (reworded)
The grass is always greener on the other side, because it is fertilized with bullshit. — Unknown
Don't be so goddamn afraid of wasting your time. Walk for the sake of walking. Read for the sake of reading. Lift for the sake of lifting. Because, in the end, what else is life than a collection of wasted times? — Dr. Bojan Kostevski