Highlights Thus Far

Posted by jeremy on 27 April, 2010 13:38

A slide show presentation is

available on google docs

[http://docs.google.com/present/view?id=d7jdr6z_418f34gnbss] 

 (More)

Part 7: Data, Tranlsation: the revival of almost dead perl skills, GIS, and more.

Posted by jeremy on 26 April, 2010 23:43

This week, I continue working with the outword data set. 

 The goal for the week is to finally realize the long held goal of distinguishing between users that walk and users that drive or ride.  While it is impossible to fully discover the ground truth of the matter, we can glean a considerable amount from the data.  I am working with the submitted words, which carry with them both a coordinate pair (of the place where they were created) and a timestamp.  The basic idea became that in order to check out whether or not individuals were walking, it was necessary to calculate their time, distance and speed between words. If the speed or distance was too high, walking would be unlikely. After attempting to construct a join to rule on joins but struggling with the proper syntax, I sought help with R's enumerator class.  This proved to be more trouble than it was worth, so I broke down and went with our old friend, perl.  I took the inital data set, and with some relatively simple looping and hash walking, I was able to create a data set that I could work with, containing the timestamp and lat/long coordinate pairs for both the current word, and the last word.

perl script

From here, I needed to calculated the distance (in time and space) between the points.  I knew that it was a bit more complicated than rise over run, as we do not live on a flat plane.  I looked at a couple of Perl modules that calculated distance between two geographic points, and found them lacking either in documentation or support, so I asked for a bit of help.

This beautiful picture shows the words (the red dots) all over the world (compare it to the one that R created earlier in the term and you very quickly recognize that tools that are good at lots of things are very easily trumped in tools that specialize.  This image was generated from ARC GIS, with a WSG 1984 Projection.  Justin proved a fantastic help here: I casually asked for advice on how to best calculate the distance between two geographic points when I ran into him at Ambrosia, and he told me to swing by his desk at the GIS Reference desk in the library.  Because the earth is kind of a deformed spheroid (egg shaped with bulges), it's important to translate the lat/long coordinates to take that shape into account. Beyond making this pretty picture, Arc GIS, Justin told me, took the distance of each point from a given point in space!  From this number, I was able to calculate the distance between the two points by calculating the hypotenuse of the right triangle formed from the three points.  Notably, because the points are all over the world, the point in space is estimated and introduces some error in the calculations.  However, because generally the distance between the points would be small (save when our intrepid outWorders fly between words), the error rate shouldn't be so high as to matter much.  (I hoped)

 

Once I had these distances in meters I could take that, along with the differences in time between one word and the word a user made exactly previous, I could calculate the land speed between words for a user.  As is true of most of the dataset, the speeds were skewed and a few outliers{1} made the data set difficult to analyze as it stood. I then took a stab at heuristically associating different speeds with possible modes of transport.  People generally walk between 2-4 mph, ride bikes at about 15mph, and drive up to about 80mph, fly beyond that.  These are really just an estimation, though, and further analysis is necessary to draw any conclusions. First and most importantly, I did not cut out entries where too much time passed between words.  This will slow down the rates, probably considerably, and skew the numbers: an individual who drives to a location but waits long enough will fall into the bike or walk category.  Two, relatedly, there is a great deal of overlap in the categories: in urban areas, its difficult to drive much faster than walking.  Three, the categories themselves are ambiguous: Driving may actually be riding the bus, or commuter rail or similar.  With these caveats noted, the following histogram shows a first cut at the counts for each type of travel.  I'll update it when I have a bit of time.

 [todo: make sequential, re-order plot, make coloring sequential]

 

The next  graphic explores one particular user's movements.  Each of the words are represented by a hollow point, colored by the guessed type of movement, same as above.  Notably, I also shifted their location.

and here's the same graphic, but I took the additional step of removing the driving data.

Here, we see the same player's location graphs, but faceted by week (x-axis, ascending) and word value (y-axis, descending).  So the hardest words to make are at the bottom, and as we can see they don't really appear until the later weeks.  The points are again colored by type of locomotion, with green representing probable walking.  

> shiswords <- ggplot(data=player1033c, aes(player1033c$nlat, player1033c$nlon, colour=player1033c$spd2f))

> shiswords + geom_jitter(position=jit) + scale_colour_manual(val= c("grey", "red", "orange", "green", "purple", "purple", "grey" )) + facet_grid(valf ~ weeks)

> ggsave(filename="latlong-smallmults-xWeeks-yValue.png")

 

My next steps will be to re-examine the script (see {1}) and re-evaluate the categories, as well as including the time consideration (factoring out things that take too long) in determining speed.

 

 

{1} notably, one of the outliers reached a speed of Mach 659. That is approximately the velocity that the solar system revolves around the center of the milky way galaxy.  Clearly, the script wasn't perfect, and part of my future work will be determining what's going on there.

 

Part 6: Polishing a graphic

Posted by jeremy on 26 April, 2010 19:43

This week, I continue working with the outword data set.  I hope to create a more polished version of a graphic this week.  I'll work with the boxplot from week 5.  

 

> notNewbies$letf <- factor(notNewbies$letterCount)

> bxplus <- ggplot(data=notNewbies, aes(notNewbies$letf, notNewbies$game_secs, fill=notNewbies$letf, colour=notNewbies$letf)) 

> bxplus + geom_boxplot() + scale_y_log10() 

It's a good candidate because it is actually hiding data.  As you can see, the boxes are largely without medians.  While I'm pretty sure that's simply because the data isn't normally distributed and they are jammed up against the lower hinges, this is a good opportunity to prove it.  I'm going to borrow heavily from Mick and Clint's experiments in cleaning up the boxplots, with a couple small changes.

 

The polishing script follows.  lets take a closer look. 

> bxplus + geom_boxplot() + scale_y_log10()  

#same

+ scale_fill_brewer(palette=3)  

 # the fill color is the third color brewer palette.  its divergent, and relatively pleasing to the eye

+ scale_colour_brewer(palette=3)

# this sets the color of the outline the same as the fill.  note that if you arent using bars, this is the command for color brewer 

+ stat_summary(fun.y = median, fun.ymax = median, fun.ymin = median, geom = "crossbar", colour = "white")

#here we're setting the medians to white crossbars.  this will make them distinguishable from the background. 

+ opts(plot.title = theme_text(size = 20,face = "bold",family = "Lucida Grande"), title="Letter Counts for Time in Game") + ylab("time\n(secs)") + xlab("Letters Held")

 #titles and such are all embedded in the opts command.  we can even set the text (Thanks Kevin).  Note the \n in the ylab string.  that adds a line break which is handy. 

> ggsave(filename="lettercountsfortimeInGameRevised.png")

#same as it ever was.

Part 5: the most dangerous profession, or how i learned to stop worrying and continue counting letters

Posted by jeremy on 13 April, 2010 08:39

To reacquaint:

 We're continuing to deal with the users data, to divine problems with the game.

One thing to note is that I've taken out part of the data, to deal with the cemetery problem.  This is a variant of the Princeton cemetery problem, referenced in this article by Wainer, Palmer, and Bradlow, '98.  It talks about selection anomolies, and discusses the finding that one R. R. Madden, a statistician calculated lifespans of various professions by looking at the gravestones in a cemetery.  He found that the most dangerous profession was: Student.  Below is a graph of age at death plotted by year of birth for that  same cemetery, which shows a disconcerting drop in age of death as we get to the present date.

 

 Disconcerting until one considers what the limitations of a headstone survey might be.  While its likely clear, I like the way that the article phrases it:

"Obviously, the reason for the decline is nonramdom sampling.  People can not be buried in the cemetery if they are not already dead." Wainer et al, 98

At any rate, this is a long discussion to say that I cut out the last few weeks of play for that same reason. 

> letts = read.delim(file="hw4/letters-gametime-middleSat.txt", sep="\t", header=TRUE) 

> notNewbies <- letts[letts$last_refresh<(as.Date("2010-01-15")),]  

letterdata1-30-10.txt  and letterdata1-30-10-latest-removed.txt 

I'm not completely satisfied with the last part's foray into the behaviors of users who quit.  I think the conclusion is valid, but I don't like having to cut the data to really see it.  I'm working with letter count, a factor indicating the number of letters a user has at the time of the snapshot, and game_secs, a measure of the time the user has spent in the game.  So we have a continuous variable (game_secs) conditioned by a discrete variable.  This is really where box and whisker plots and frequency plots should show the difference, but because the data is skewed (and in a way that log transforms aren't helping) the conclusion doesn't really pop.  This week, I'll play with stats to see if I can figure it out. 

Before I  work on boxplots and density, I'll start with histograms.  To put it in graphics, my concern with last weeks findings are thus:

> notNewbies$lcount <- notNewbies$count.objects.object_id

> mine <- ggplot(data=notNewbies, aes(x=notNewbies$lcount))

 > mine + stat_bin(aes(x=notNewbies$lcount, y= ..density..), binwidth=0.1, geom="bar", position="identity")

> ggsave(filename="histdensity-all-have-higher-dist-at7.png")

 

While its true that the lower play-time users appears to exhibit the character discussed, it also appears true of the whole (but as the last histogram last week showed, if we only look at the data for users above a week's play, the pattern disappears.  By applying statistics and specifically density, the pattern becomes more clear without cutting.   The pattern begins to emerge in this heat map, which through the use of scale gradients, shows that while the bulk of the action is toward the lower play times, there does seem to be more around 1 and 7 letter counts. But let's keep going.

 

> m2 <- ggplot(data=notNewbies, aes(notNewbies$lcount, notNewbies$game_secs))

> m2 + stat_density2d(geom="tile", aes(fill= ..density..), contour=F)

> last_plot() + scale_fill_gradient(limits=c(1.5e-08,2.6e-06))

> ggsave(filename="note-this-stratified-gradient.png") 

Here, using a 2d hex binned histogram, the pattern is much clearer.  We can see that while there is more data under 500,000 seconds (6 days), its clear that what's there has significantly more density at 1 and 7.

 

> m2 + stat_binhex() + xlab("Letter Count") + ylab("time played (seconds)") 

> ggsave(filename="itshexed-hist-of-intensitys-letts-game-secs.png")

 

 On to boxplots.  This blog has a jitter over boxplots, which is a jump start.

 

 > qp <- qplot(factor(notNewbies$letterCount), notNewbies$game_secs, geom=c("boxplot", "jitter"))

> qp

> ggsave(filename="jitterbox-lettercountVgamesecs.png")


 
 
 > bx <- ggplot(data=notNewbies, aes(factor(notNewbies$letterCount), notNewbies$game_secs,))
> bx + geom_boxplot() + scale_y_log10() 
> ggsave(filename="bigboxesofconfusion.png")
 
more to come... 
 

Part 4: Building it up (and a long data preamble)

Posted by jeremy on 08 April, 2010 18:58

In this part, I will methodically build up one plot through multiple iterations.  Some of these will be shown.

 

We'll still be looking at the user data from outWord, attempting to glean what we can about our users from the words they build and the way they do so.  This week, I will explore the characteristics of users who do did not stick with the game long.

A Data Preamble 

This week, after seeing Ben's explanation of using sqllite3 as a db, I felt inspired to do the same.  R has a sqlite module too, which was is to be retrieved with:

library(RSQLite)

which pulls down the DBI package for R as well.  
 
Ben's instructions were for python, though, and I am a bit more familiar with perl.  Lo and behold, the internet has provided those too.  I have had mixed results with installing from CPAN, perl's home for modules, but it actually worked like a charm this time.

at terminal: 

sudo perl -MCPAN -e shell
cpan> install DBI

cpan> install DBD::SQLite 

cpan> install SQL::Translator 

 then to test i ran their sample script, which i've turned into a little pl.  It worked beautifully.  Then I went to my data.  I pulled a day that I selected as being after a lot of people downloaded it, but while the game was still very early (looking at the histogram of first log-ins from last time) backup.  I discovered, much to my displeasure, that the conversion from mysql to sqlite is not clean.  One solution involved shell scripting, which I considered trying.  Then, through the SQLlite Docs, I discovered SQL Fairy, which was a group of perl modules to (in theory) do the same.  

However, even after much trying, I was not able to get the SQL translator to read the MySQL dump and translate it to something that SQLite didn't choke on.  Instead of throwing good time after bad, I chose to explore a different route.  

So we're back to playing with flat files.  On the upside, I was able to get the data from dump file into a MySQL database outside of production, on a server that I have full rights to, that I could monkey with.  This was also not without difficulty.  Not wanting to go through the process of setting up and configuring a server on my computer, I decided to try my hand with my existing hosted account with GoDaddy.  GoDaddy provides reasonable hosting at a reasonable price, though it definitely has it's limitations.  One major benefit of this was the presence of phpmyadmin, a piece of software that allows SQL queries with a bit more feedback and reference than the command line.  While phpmyadmin won't win me any nerd points, it is incredibly useful.  Besides, the time and patience devoted to dealing with this particular part of the project had stretched to consume other, potentially more valuable tasks.  Notably, my dump files were bigger (from 6-12mb) than the allowed db size for importing with phpmyadmin (which cuts out about 2mb or greater), but thankfully I was able to find and follow the instructions at technosophos that made the process a relative breeze by using GoDaddy's database restore functionality. 

I actually took three database dumps, from a Saturday near the launch of the game, a Saturday near the middle, and a recent Saturday.  I chose the same day of the week, Saturdays as this would allow me to compare week scores as well, should I choose to.  From here, I set up three DBs on GoDaddy, did the aforementioned restore process, and went to work with SQL and phpmyadmin.  At this point, I decided that I'd continue to work in flat files generated form exports as I was running out of time and wanted to get to the fun part.

 

My Question

I wanted to see if I could find out more about the people who only played for less than a day.  Specifically, one question I had concerned a game feature in outWord: in order to get rid of letters, you must shake the phone.  It wasn't easy to document this feature and it was never clear that people understood it.  Therefore, I wanted to see if there were a lot more people who quit playing with 6 or 7 letters.  While, there were at least a few reasons that high letter counts might correlate to a player to quit playing, one is that they could be jammed in with 6 or 7 letters that they weren't able to construct into a word.  

  

almost to R

Finally to R where my data issues were not at an end.  Dealing with R and date functions proved to be once again problematic.  Sometimes R seems fine converting strings to dates, sometimes not.  I think the problem was this time I was interested not just in days, but also more granular time than just days.  In researching the problem, I found that it was probably possible to do the necessary time conversions in R, but more trouble than it was worth.  After reading eSawdust's explanation of the R Time Conversion process, I noted their conclusion:

If you can do your date conversion in the data outside of “R” you are probably better off to convert the date in the data before import. The “R” date/time classes are clunky and produce unexpected results in most cases except the simplest (pure GMT or local times, but arbitrary timezones are not well handled in “R”.) 

That has certainly been my experience.  Thus, I chose to do these conversions in MySQL, which has a nice library of date functions that work mostly as advertised.  So I reworked the query a number of times, ending up with a flat file that I was able to easily export using phpmyadmin.  In the interest of time, I decided to hold off on implementing the R DBI mysql to a remote server until later.

Query Used:

SELECT users.username, users.week_score, users.total_words, users.user_id, TIME_TO_SEC( TIMEDIFF( users.last_refresh, users.first_login ) ) AS game_secs, WEEKOFYEAR( users.first_login ) AS game_week, COUNT( objects.object_id ) , users.last_refresh, users.first_login

FROM  `objects` ,  `users` 

WHERE users.user_id = objects.user_id

GROUP BY objects.user_id

 Then, at the bottom of phpmyadmin there is an export button.  I specified CSV, then set the options thus:

 

making sure to check the "Save as file" button and giving it a name.   

 

To R & ggplot2 (finally) 

Whew.  Finally.  In R, we can begin constructing a plot.  Bit by bit, we will construct a plot to see if I can find out what's going on with early users.  

 

 

#read in the file and convert the strings to dates, compute the other variables

> letts = read.delim(file="hw4/letters-gametime-middleSat.txt", sep="\t", header=TRUE)

> letts$first_login <- as.Date(letts$first_login)

> letts$last_refresh <- as.Date(letts$last_refresh)

> letts$game_days <- (letts$game_secs / (60*60*24))

 #because we want to look at players who quit after playing for a short time, lets filter off those who have just started.  We don't actually now if anyone quit or not, so we're just going to set those players who have only been at it a little while aside.  lets just take a look at the section of the data without them.  This is a variant of the cemetery problem (more on that later).

> notNewbies <- letts[letts$last_refresh<(as.Date("2010-01-15")),] 

 

we want to compare the game time with the letter count for our theory 

> j <- j + aes(x=notNewbies$letterCount, y=notNewbies$game_secs) 

> j <- j + aes(x=factor(notNewbies$letterCount), y=notNewbies$game_secs)

lets start with a box plot.  discrete variables mapped to continuous is boxplot territory. 

> j + geom_boxplot() + coord_trans(y = "log10")

> ggsave(filename="badbox.png")

the results are less than fantastic.  Most of our data is toward 0, with some rare ones up top, and log transforming helps barely if at all. 

 

lets try a histogram. 

> m <- ggplot(data=notNewbies) + geom_histogram(aes(x=game_secs, y=..density.., fill=factor(letterCount)))


> m <- ggplot(data=notNewbies) + geom_histogram(aes(x=game_secs, y=..density.., fill=factor(letterCount)), position="dodge", binwidth=500000)


> m <- ggplot(data=notNewbies) + geom_histogram(aes(x=game_secs, y=..density.., fill=factor(letterCount)), position="fill", binwidth=500000)

> m

> ggsave(filename="filled-hist-dens-secs.png")

Saving 7" x 7" image

So this one sort of hints at what the boxplot didn't: that there seem to be more 7's at the bottom of the spectrum.  But its really noisy.  The problem is that most of the data is down there, as the Dodge shows. 

> m <- ggplot(data=notNewbies) + geom_histogram(aes(x=game_secs, y=..density.., fill=factor(letterCount)), position="stack", binwidth=500000)

> m

> ggsave(filename="lame-stack-hist-dens-secs.png")

 

well those are interesting.  You can see the others in resources. 

But for my money, here's the big tell.  A simple histogram of the players who have played under 5 days (after filtering out new players).  There seems to be more at the ends, which corresponds to the theory that players are either downloading and not doing anything with the game (which one must expect since it's free), or else downloading, filling up with letters they can't use, and never figuring out how to get rid of them.  

 

 

> shortTimers <- notNewbies[(notNewbies$game_secs < 500000),]

> n + aes(fill=factor(shortTimers$letterCount)) + geom_histogram()

stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

> 500000 / (60 * 60 * 24)

[1] 5.787037

> ggsave(filename="under5daysHistLetterCount.png") 

One reasonable question you might have is: maybe the overall distribution looks the same.  Well, it sort of does.  But if you look at the inverse of this, the letter counts are much more mundane. Observe:

 

Part 3: Factors

Posted by jeremy on 01 April, 2010 22:31

I'm now going to take a look at factors in ggplot, as they might apply to the outWord data set.  Sorry for the delay in posting.

 So first off, I took a couple looks at the data.   I finally got the date functions that were screwy last time to work, and here's a histogram of first logins.  outWord got most of its users right away, and its been slowly tailing off ever since.

  

This week, I mainly was interested in how player scores and words were different.   The following graphic shows the cuts where I drew the factors in number of words made, by color.  Number of words, as we affirmed last week, are pretty linearly related to scoring, but are far less long-tailed.

 

 

 

So using those cuts,  I wanted to see if the time people had been playing (in days) had any relationship to the number of words they made, and their scores.  And basically, there isn't much there.  The very few insanely high scorers are pretty evenly distributed over lengths of time, but they do all share the fact they make lots of words (though even that is spread out considerably).

 

 

To further tease things apart, I wondered if it was also true that high scorers were making more words,  Here the factor is scores, with a histogram of score per words.  

 

 

and here are the scripts:

# I had a lot of trouble with dates last time, (and this time) and I don't know exactly what i was doing wrong, but 

#it turns out to be pretty simple

> usrs$first_login_date <- as.Date(usrs$first_login)

> class(usrs$first_login_date)

# [1] "Date"

> usrs$last_refresh_date <- as.Date(usrs$last_refresh)


#if we take the first_login away from the last_refresh, we have the time in game.

> usrs$play_length <- (usrs$last_refresh_date - usrs$first_login_date)


> qplot(usrs$first_login_date, geom="histogram")

> ggsave(filename="histOfFirstLogin.png")


#here are the factors for words

> wordCuts = c(-1,0,1,10,300,1000,2000,9000)

> wordCutLabels = c("none","one","under10", "10-300", "300-1000", "1000-2000", "over2000" )

> wordFactors <- cut(usrs$words, breaks=wordCuts, labels=wordCutLabels)

> wordFactors <- cut(usrs$total_words, breaks=wordCuts, labels=wordCutLabels)

> table(wordFactors)


#make a scatterplot of the words

> p <- ggplot(usrs, aes(as.numeric(play_length), score, color=wordFactors))

> p + geom_point()

 

#here are the factors for scores

> score_cuts <- c(-1,1,32,128,5000,10000,50000,100000,464601)

> score_labels <- c("none","almost none","little","medium","medium-high","high","really high", "insanely high")

> usrs$scoreFactor <- cut(score, breaks=score_cuts, labels=score_labels, ordered=TRUE)

> table(usrs$scoreFactor)

           

#this makes the histogram showing the score per word, factored by score            

> p <- ggplot(usrs, aes(scoreperword, fill=scoreFactor))

> p + geom_bar()

> ggsave(filename="histScorePerWordFactoredByScore.png")


#wordFactors

     none       one   under10    10-300  300-1000 1000-2000  over2000 

      587       110       270       269        41        11        13 


#Thanks go to:

#http://www.stat.berkeley.edu/classes/s133/factors.html

#http://dnfehrenbach.com/data_analysis/part3.html

#also here are some that I didn't run:

> qplot(as.numeric(play_length), data=usrs, geom="histogram", binwidth=10) 

Part 2: Quick Overview of the data

Posted by jeremy on 25 March, 2010 12:30

This week, I wanted to just show a few graphics that quickly give an overview of the kinds of data that I'm looking available.

 

First, let's take a look at where the users are.  

 

 This graphic is a simple histogram of users per state.  I was actually unaware that there were so many users in California, almost as many as are in Michigan. 

 

 

This graphic shows a map of the US, with user's locations plotted.  There are actually a significant number of players around the world, especially in Europe.  I will create a infographic of their locations soon as well, but this shows that most of the players are around the coasts, and seem to be in Michigan.  The bubble size 

 

 

This graphic shows a scatterplot of the words and scores.  Obviously there's a relationship between the two, as words are the only way players can score in the game.  However, take a look at the two top scorers.  One made over 8k words, the other has around 7k, but has a higher score.  I can infer from this that the player has a much higher score/word average, meaning they are on average making longer, more multiplied (using red letters) words.  

 

 Update: 4/13 That one is wrong.  here's the right one:

 

This final graphic shows a histogram the players have spent in game.  (it needs work, as the data formatting functions seem to be giving me the wrong output, hence the negative numbers)

 

And here are the scripts needed.  The data file is forthcoming, but I want to make sure I anonomyze it first. 

#read in the data

usrs = read.delim(file="data/users.txt", sep='\t')

head(usrs)



#scatterplot of the words vs score

qplot(total_words, score, data=usrs, geom="jitter") + stat_smooth()



#map the US high scores, bubble size indicating score

justUS <- usrs$location %in% grep(",\\s[A-Z]{2}$", usrs$location, value=TRUE)

subset(usrs, justUS)

ggplot(USusrs, aes(USusrs$last_long, USusrs$last_lat)) + borders("state") + geom_point(aes(size=USusrs$score))



#histogram of the number of players by state

locUS <- location[c(grep("\\s[A-Z]{2}$",locdata$location))]

locUS$locstate <- sub("^.*,\\s", '', locUS$X_data)

qplot(locUS$locstate, data=locUS, geom="histogram")

ggsave(filename="playersBYstate.png") 

  

 

usrs$timeInGame <- as.numeric(usrs$last_refresh) - as.numeric(usrs$first_login)

qplot(usrs$timeInGame, data=usrs, geom="histogram" 

Exploring ggplot part 1: a simple pie

Posted by jeremy on 17 March, 2010 19:07

 

This week, I will start with a simple pie chart.  Although these are frowned upon by the data elite, they can nonetheless be good at getting a quick glance at where the data lies, and are almost universally understood.  

For data, I wanted to see the distribution of word lengths created in outWord.  Over 100,000 words have been created now, so the data is there.  This will be a good demonstration of a couple important R and ggPlot concepts.  I want to start simple, so I'm going to try to build a pie chart of word length distributions. I know how to get this number from the db:

mysql> select count(word_id) from words where word REGEXP '^.{5}$';  

 the only clever thing I learned here is that you can in fact use regular expressions in mysql.  This is pretty close to the beesknees.  This simple regular expression just makes sure that there are 5 "{5}" somethings ".", I could also have done "\w" for word characters, but I know the data, and only words appear in this particular field.  I actually did this 6 times, for each character count.  I know this is a bit more manual and doesn't scale, but we've only just begun.  I created a wee csv file for the compiled stats. To the R!

#load ggplot2

library(ggplot2) 

#read it in to a file, using read.delim, specifying the separator as ",".  could also use read.csv

ezWords = read.delim("http://jeremycanfield.com/leverage/projectdata/wordcounts.csv", sep=",")

#check out the first part of the data with head 

head(ezWords) 

#assign the ggplot object to variable simplePlot

#ggPlot takes parameters here for the data (note that it must be a dataframe, no vectors)

#then aes, meaning aesthetics, meaning what you want to plot.  here its

#x=factor(1) i actually don't know what this means, but its needed to make the pie be continuous. (see below) 

#, number.of.words (a column in our data set) 

# fill = factor(letters) fills in the with our factor, the letters variable (number of letters)

simplePlot <- ggplot(ezWords, aes(x=factor(1),number.of.words,fill=factor(letters))) + xlab(NULL) + ylab(NULL)

#at this point, we've assigned a bunch of data and even talked a bit about how we want that data

#displayed, but we have to give it a representation.   I give it a bar, with the stat tranformation of 

#identity (no transformation!) meaning just use the data in the variable. 

simplePlot = simplePlot + geom_bar(stat="identity",width=1) 

#so funny thing, a pie chart is a bar chart, with polar coordinates.  (see below for more on this) 

simplePlot = simplePlot + coord_polar(theta="y") 

simplePlot = simplePlot + opts(title ="Word Length Distribution") 

#let's see what we've got

simplePlot

word distribution pie with title

#to output it to your working directory: 

ggsave(filename="wordpie.png") 

Okay, that's not half bad!  Thanks to Hadley Wickham for the documentation.  Also check out his pac-man pie, which is old news, but still fun.

Thoughts, comments:  Well, its tricky to see, but it looks like 4 letter words are the most popular, followed by 3 letter words, and so on and so forth.  That's interesting from my perspective, because 3 and 4 letter words are often fairly easy to make without actually moving (the proposed point of the game).  I'll have to dig a bit deeper to find it out for sure.

 More weird word pies:

 

> simplePlot <- ggplot(ezWords, aes(x=factor(letters),number.of.words,fill=factor(letters))) + xlab(NULL) + ylab(NULL)

> simplePlot = simplePlot + geom_bar(stat="identity",width=1)

> simplePlot

 more pies and bars and wedges of the letter length distribution

> simplePlot + coord_polar()

more pies and bars and wedges of the letter length distribution

> simplePlot + coord_polar(theta="y")

more pies and bars and wedges of the letter length distribution

> simplePlot + coord_polar(theta="x")

> ggsave(filename="wordwedgesthetaX.png")

more pies and bars and wedges of the letter length distribution 

Well, of those experiments, I have to say that the standard bar chart seems to shed the most light on the subject.  Its interesting to note that 7 letter words outnumber 6 letter words, which do pretty poorly.  If people have 6, they go ahead and try to get that 7th.  Fun.  For the most part, they stick to the short ones, though.  I think next I'll see if the high scorers have a different word length distribution than the low scorers.  I bet they do.

It's interesting to note that setting theta to x produces the same output as not setting it at all.  Must be the default.  It's also interesting to note that without that "x=factor(1)", the pie won't form a circle.  The race track diagram is also kind of interesting to look at, but not immediately useful.  It might be good for progress indication of several factors at once, but the fact that the inner circle has a considerably shorter arc length to represent the same amount seems problematic.

 

an introduction.

Posted by jeremy on 16 March, 2010 18:20

This blog will act as a repository for my interest in ggplot.  I will, over the course of the next few weeks, generate a number of data graphics.  At least at first (I reserve the right to change), I will begin this exploration using data from outWord, a game I helped develop last summer.  It is geo-location based iPhone game that is rather like a mix of scrabble and a scavenger hunt; letters are spread out all over the world, players can pick them up to make words with them, to score points.  More about the game can be found at phonagle.com.  

 The data will mostly come from the dumps of the outWord database, so fortunately it shouldn't require too much in the way of clean-up, but what is necessary I will try to document here.  I'm a little skittish about posting the mysql code, but I'll at least try and keep a copy of whatever data I use handy, save that which I think might contain personally identifiable information about our users.  I may try to obfuscate it in some way, especially the location data.