What data do you use, and where does it come from?
MLB releases detailed data for every pitch of every game. Each morning, my program grabs all of this pitch by pitch data from the previous day’s contests. Within the data, each pitch is assigned 89 attributes, from the pitcher's release position to the pitch’s horizontal acceleration. We care about 5 of those 89 values. Two are the pitch’s horizontal (plate_x) and vertical (plate_z) position as it crosses the plate. Two are measures of the top and the bottom of the strike zone (sz_bot and sz_top), values that reflect the size of the zone once adjusted for batter height and stance. Finally, we use the resulting call of the pitch. In conjunction, these 5 values can tell us whether a pitch was a strike or a ball, and whether or not it was called correctly.
Why doesn’t your data match with what I saw on T.V.?
For one, the box is not exact. Camera angles and shake can alter the position of the T.V. strike zone relative to the true strike zone. Second, TV boxes often don’t correctly represent the ball itself. The dot that the TBS electronic strike zone uses to represent the ball, for example, is much smaller than an actual baseball. Finally, it is unclear which, if any, T.V. broadcasts adjust the virtual strike zone to adjust for batter stance on a pitch by pitch basis.
Why doesn’t your data match with what I saw on this other online graphic?
Strike zone graphics from Baseball Savant, ESPN, and other sources do not adjust the top and bottom of the strike zone between at bats. MLB's Gameday feature adjusts between at bats but (seemingly) not in between pitches. At the same time, the @Umpscorecards strike zone plots adjust for the strike zone of every pitch, resulting in differences relative to the graphics of other providers. For more information on how this works, click here.
Why doesn’t the archive data always match with the Twitter graphics?
Unfortunately, MLB does a small amount of post-processing of each game's data which can have a small (but noticeable) impact on results. That means that each time the data on the site is fully refreshed, the archive may no longer be fully aligned with the graphics.
Why is some data on the archive marked with an * or marked as ND?
As of v3.0.0, the @UmpScorecards platform tracks games that have erroneous or missing pitch data to ensure that game counts — the number of games each team has played, for example — are correct across the site. On the Games tab, such games are marked with an * if 5 or fewer pitches are missing data, and ND otherwise.