2011-07-03

Crowdsourcing

Or GEO database and stupid turkish people.

Crowdsourcing is an interesting idea. It’s based on an assumption that greater and difficult (costly) task can be (in reallity where we have computers and Internet) divided into smaller tasks (sub-tasks, chunks) and performed by crowd of people who happen to have just some basic required knowlegde and a bit time to do small part of this bigger task each. A task that would take hours or days for only one person to complete and one person wouldn’t probably have all the data / expertise needed.

I myself decided to use this approach to do (and possibly complete) two quite not easy tasks:

1. Translation of a program interface (labels, dialogs, messages, etc.) into some foreign langauges (I only know polish, english and basic german and russian). In this particular case it’s my freeware Button Generator for Windows (I called it Button Generator Translation Project).

2. Collect and update geolocalisation data for all polish IP addresses (geo.4u.pl) for use with my STAT4U system (first polish web statistics).


Here are some of my observations on the subject after few years in the process.

In general people are heplpful. They seem to want to contribute out of their own free good will. That really keeps my belief in humanity up.

Translation of Button Generator into spanish and italian wouldn’t be possible without this great external help. Google translator is still not an option as it often tends to produce real crap. Also collecting geolocalisation data is not possible without such a help. Some things can be learned from whois or DNS, but a lot can’t. In Poland we have some ISP who have total mess in their whois data. For example Neostrada (by TP S.A.) is one big mess. Big IP areas here are described as Warszawa or PoznaƄ or Lublin while in fact they are totally fragmented across the whole country. Some other ISP like UPC or NETIA have no city descriptions in whois at all. So the only way is to relay on what people tell you. And they usually give honest and accurate info (I always try to verify it to maximum extent possible).

These were the positive aspects of crowdsourcing. The negative ones are really annoying. I understand they are impossilbe to avoid, but ...

Why the hell people from Turkey are so stupid? They may not understand polish. That’s OK. They may even not understand english. But why do they have to fill web form they don’t understand. Why do they select city Turek (which is a CITY and is in Poland) and submit it as their location, when the form says clearly „We don't collect GEO information from outside Poland. If you are not in Poland please do not submit this form since your data will be discarded anyway.” ??? This happens every time I update the database (which is only semi-automatic). And there are always some IP-s from other countries entered as polish cities. WTF?
What I mean is I always process the data, but why people bother to enter such crap anyway?

Button Generator Translation Project – here often happens another crap – someone (or somebot) enters a lot of random text with some spam-like URL’s in it. I had to fix the form to be immune to (or ignorant of) HTML tags, URL and e-mails. This is something that shoud probably be done from the beginning. But I didn’t realize someone will actually do bad things here. There is no reason - again I will process everything manually when it’s time (when 100% is translated). UPDATE: I need to fix it more, it’s still not 100% immune!

Conclusion

I would guess when using crowdsourcing to accomplish something quality distribution will be more or less gaussian. One has to apply some filters on incoming data anyway - as usual always validate user input (any input). So standard do not trust rule (good oldschool paranoia) is ALWAYS required. Then and only then there will be overall benefit in the process.

p.s.
It seems to take a lot of time to complete crowdsourced task. Unless some marketing is done around it.