I recently created a sizable human-coded dataset (5,000 items) of international news using the Prolific Academic service. The Prolific Academic is an Oxford-based academic alternative to the Amazon Mechanical Turk. The advantage of using this services is that researchers only have to compensate for work that they approve. The potential drawback is its relatively high costs. The service require researches to offer ‘ethical rewards’ to participants, and the minimum rate is £5. Most of the participants of the Prolific Academic are university students, but may be the same.
One of the reasons I had chosen the Prolific Academic over the Amazon Mechanical Turk was that classification of international news stories by the Turks may not be very accurate since the Americans are infamous for the lack of knowledge about foreign events.
The classification accuracy of the Prolific Academic participants in my project is shown below by country. Locations of participants (based on IP addresses) are concentrated in three countries, the UK, the US India, and the estimated accuracy (0-10) of the coding by the participants seems to be supporting my hypothesis: the Americans are not good in analyzing international news stories…
accuracy n percent Austria 7.000000 1 0.3 Thailand 6.000000 4 1.3 Viet Nam 6.000000 4 1.3 United Kingdom 5.931250 160 51.8 Canada 5.666667 3 1.0 Spain 5.666667 9 2.9 Romania 5.600000 5 1.6 United States 5.192308 26 8.4 India 4.956989 93 30.1 Czech Republic 4.500000 2 0.6 Portugal 4.000000 1 0.3 Philippines 3.000000 1 0.3
The estimated accuracy of the US participants are much lower than UK counterparts. The low accuracy of the Indians participants seem to be due to their limited English language skills. Despite the prerequisite that English is the first language, the high hourly rate, which is very close to the minimum wage in the UK, attracted a lot of less qualified people. The Indians are only account for 2% of the registrants to the service, but it was 30% in this project.
I was expecting that the participants’ classification accuracy increases as they perform more tasks, but quite the opposite was the case. Some of the participants did really good jobs initially, but their classification accuracy usually decreased and sometimes became below 70%. The declining tendency in performance can be explained by participants’ attempt for cost minimization.
Those observations raise questions in crowd-sourced content analysis:
- Whether the Amazon Mechanical Turk is the always the best crowd-sourcing platform?
- Should we offer different amounts of reward to participants according to country of residence?
- How can we maintain or improve performance of participants over the course of projects?