Z-Scores in Python, Part 1

Ready to get your nerd on?

I’ve decided to start playing around with programming. I took a couple computer science classes in college, and I’ve always wanted to get better at it, plus I’m about to start work at a software company, so I guess it’s a good thing to be good at. Obviously, I’m building baseball-related programs, and ideally, I’ll get good enough at this that I can start doing some real research, or at least build some handy tools for fantasy baseball.

Full disclosure before I start: I’m not very good at this, so I’m going to basically be walking through some of the things I’m doing and what I’m struggling with. If you know nothing about programming, then this will probably be very uninteresting for you. If you are an expert software developer, then I encourage you to laugh at how badly I’m screwing up the simplest code. If you’re a nice expert software developer, then after you are done laughing, I would appreciate any advice you might have about my particular code, or any other baseball-related projects that you think would be good practice for me.

Ok, so for the past few days, I’ve been fiddling with two different projects. The first is fantasy baseball related. Usually when I’m making my preseason rankings, I download an Excel projections spreadsheet, then I make a bunch of columns with z-Scores for the categories I want, total them, and adjust for position and the like. It’s a painfully long and arduous process, so I want to create a program that does it for me.

There are a lot of ways I could do this, and I probably chose the worst one. I decided to use mySQL, and after hours and hours of figuring out how to install the mySQL for Python module thing on my computer, I finally got it working and started coding. First I had to write a script to move the data (ZiPS hitting projections for 2012) from the .csv file to an SQL table, with columns for every stat in the projection sheet. Easy enough.

Then came the fun part. My end goal: have a .csv file with all the projections for the categories of my choice, plus z-scores for each category, total of the z-scores, somehow normalized and adjusted for position. So how the hell would I do that?

Here’s what I did. I’m realizing now that I went about this project in probably the entirely wrong way, so right now I’m at the point where I can either keep moving forward or start over using a different method.

  1. Got data from the SQL table based on inputted stat categories, and created a dictionary with the keys being player names and the values being lists containing their projections in the order that the stats were inputted.
  2. Iterated through the dictionary to create a “statPopulations” list of lists, each list containing every value for a certain category.
  3. Iterated through the dictionary again, this time creating z-scores (using statlib) for each category for each player and appending that to the dictionary entry. For batting average, since it is a rate stat, added another entry that was the z-score times number of AB (called zBATimesAB)
  4. Iterated through the dictionary a third time (you see why I think this was a bad way to do it now) to find z-scores for zBATimesAB and added that to each dictionary entry.
  5. Used csv.writer to create a header row with stat names and a row for each player, containing each of the original projections plus z-scores for each one and the extra two calculations for batting average.

After a lot of fiddling and error handling (banging my head on the computer), it somehow worked! I created a .csv file, which I opened in Excel, that looked just how I wanted it to look. Just a few huge problems though.

First of all, it is a horribly inefficient program, and takes about 20 seconds to run.

Secondly, I don’t have the z-score totals there, and I really don’t want to have to iterate through the dictionary again to make it. This makes me think that building a dictionary wasn’t the best course of action.

I also still need to adjust for position, and figure out what my population is. I have about 1000 players so far, but obviously not all of them are getting drafted. I need to figure out who is getting drafted, and then calculate z-scores based on that. But that’s going to depend on position and the z-scores themselves! Yeah, it’s a pain.

So that’s where I’m at with this. I’m happy that I’ve managed to actually write a program that works to some extent, but I have a feeling that if I really want it to be useful, I’m going to have to rewrite it completely. Using mySQL might be completely pointless, and building dictionaries instead of lists might be wrong too. I have a feeling that I either need to use SQL the whole time, adding z-scores to the SQL table instead of making dictionaries, or scrap SQL altogether and figure out a better way to transfer the original data to Python form.

Wow, I can’t believe I just wrote all that. Well, if you got through it all and have any suggestions, questions, or comments for me, let me know. Sometime later, I’ll talk about the other project I’ve been messing with: simulations (also related to fantasy baseball – I have no life)!

Advertisements
Tagged , , ,

5 thoughts on “Z-Scores in Python, Part 1

  1. Tommy Hunter says:

    Unfortunately, true z-score is an inefficient function. But the samples are small enough that it’s fine to still use it. However, I would suggest using MySQL to help you calculate them. All you need for the z-score is mean and std-dev. You can run 2 simple query:

    SELECT STDDEV_POP(column1), STDDEV_POP(column2), (…) FROM STAT_TABLE;
    SELECT AVG(column1), AVG(column2), (…) FROM STAT_TABLE;

    Then you only have to do one iteration in Python to calculate the player’s individual z-score.

    And I agree, one of the tough questions I had to think about was “who should I calculate?”. Overall, it doesn’t matter that much for the results since the ordering should still be the same, it’ll just be skewed. But in terms of efficiency, that’s a real problem. You could use a random sample to calculate the std_dev and mean, and that way you aren’t doing multiple iterations through the entire 1000 people. Search around on google, there are probably efficient MySQL scripts for efficient std_dev samples.

    And I think SQL is the way to go. It’s a good thing to learn, and it’s all more extensible for future applications or features.

    One last thing, I’m wondering if zBATimesAB is even necessary. I’m just trying to think through it, and it seems like having Hits z-score and AVG z-score is enough in most cases. But I’m not really sure, so what were you hoping to do with it?

    • hunterm726 says:

      Thanks for the advice Tommy. Yeah, the more I think about it, the more SQL makes the most sense.

      Well, the ordering wouldn’t necessarily be the same for different sized sample. Say players 300-1000 all had relatively average numbers of SBs, but very low HR totals. Then if you calculate z-scores with all 1000 players, those with a lot of SBs will get penalized, and those with a lot of HRs get rewarded, even though those lsat 700 players weren’t actually part of the sample. On Excel I fixed this by only using like 300 players as the population and just sorting the zTotal until the ordering stopped changing, but like you said, that would be pretty inefficient to do on Python/SQL probably.

      I had just read that for rate stats, since a .330 AVG with 600 ABs is more valuable then the same AVG with 300 ABs, you need to multiply the AVG z-score by the number of ABs, then take the z-score of that. The alternative is to convert AVG to a counting stat like this: (Hits – (Mean Hits/Mean AB) * AB), then finding the z-score of that. Maybe that would be easier since you only need to calculate one z-score instead of two.

      • Tommy Hunter says:

        Ah, yeah. I was thinking in terms of individual stats it shouldn’t change the order, but in a more complete application, you’re right.

        I still don’t know the best way for a “rare” stat, but I was thinking, could you do something like this. zBAandAB = (zBA+zAB)/2 ? Since both parameters are z-indices (which are hopefully “normally” distributed about 0), then the result should be on the same scale as z-scores, and it would combine the two. Maybe it wouldn’t be a super precise combination, but it would pull back people who have low scores in zAB. You’re the philosophy major, is there a flaw in my logic? Cause that actually seems like a pretty decent compromise to me.

      • Matt Hunter says:

        That sounds ok, but I’m not sure it would actually be accurate. For one thing, zAB isn’t normally distributed, since most starters will have similar amounts of ABs, but guys that might not start all the time will have way fewer ABs, the distribution is left-skewed (left? right? can’t remember which to use). Also, I would think that zBA should be weighted more than zAB, though maybe not. I think I’d rather go with the more complicated solution for accuracy.

  2. […] Last week, I talked about my experimenting with creating a z-score spreadsheet for fantasy baseball projections. The other project that I’ve been working on is, or at least will be, a little more theoretical. Well, I’m not sure theoretical is the right word for it. Let me explain. […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: