Count word frequency is a popular task for text analysis. In this post, I describe how to count word frequency using Java HashMap, python dictionary, and Spark.
Use Java HashMap to Count Word frequency
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
|
import java.util.Arrays; import java.util.HashMap; import java.util.List; import java.util.Map; public class JavaHashMapExample { public Map<String, Integer> countFreq(List<String> list){ Map<String, Integer> histogram = new HashMap<>(); for(String word: list) { if (!histogram.containsKey(word)) { histogram.put(word, 1); } else { histogram.put(word, histogram.get(word) + 1); } } return histogram; } public static void main(String[] args) { String[] a = {"a", "b", "c", "d", "a", "b", "a", "d", "a", "c", "c", "d", "a", "c", "c", "c"}; List<String> wordList = Arrays.asList(a); System.out.println(new JavaHashMapExample().countFreq(wordList)); } } |
{a=5, b=2, c=6, d=3}
Use Python Dict to count word frequency
|
words = ["a", "b", "c", "d", "a", "b", "a", "d", "a", "c", "c", "d", "a", "c", "c", "c"] freq = {} for w in words: if w in freq: freq[w] += 1 else: freq[w] = 1 print freq |
The output:
{‘a’: 5, ‘c’: 6, ‘b’: 2, ‘d’: 3}
Use Spark to count word Frequency
The above method works well for small dataset. However, if you have a huge dataset, the hashTable based method will not work. You will need to develop a distributed program to accomplish this task.
[Read More...]