A blog about designing and developing Business Intelligence solutions using Microsoft SQL Server
Header

T-SQL: Get a List of Most Frequent Words in a String

October 21st, 2011 | Posted by David Stewart in T-SQL

I saw a thread on MSDN Forums recently asking for T-SQL to get a list of the top 10 most frequent words in a given string. While this could be done using cursors, I have a general dislike of them! It got me interested in whether or not this could be achieved using a common table expression…

So I had a think and put together some T-SQL, with some success (it didn’t have a syntax error or run forever). After a few changes I had broken the string up into a table of words, with a bit to indicate whether or not each record is a valid word. In this case, I just defined a word as a string of characters containing only a-z or A-Z, followed by any other character (or no character). This could easily be adjusted to include other characters though, for example ‘ to allow we’re, haven’t, etc.

Essentially the logic behind the CTE is…

  1. You have two integers a and b – one indicating the start of the current word (a), the other the possible length of the word (b).
  2. If the character at a isn’t an alphabetic character, then this isn’t the start of a word. So we need to move a to the next character.
  3. If the character at a + b (one character after the end of the possible current word) isn’t an alphabetic character then we’ve found a word. So we should mark the string starting at a of length b as a word. In this case we set a = a + b + 1, and reset b = 1. If a + b is an alphabetic character, then we haven’t reached the end of the word yet and we need to increment b by 1.

Below is the T-SQL I used, applied over this post.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
DECLARE @Text var char(max) = ''; -- remove space from var char

-- A - Z = 65 = 90
-- a - z = 97 = 122

WITH
Chars AS (

SELECT
   1 AS a_in
,  1 AS b_in
,  CASE
      WHEN
         ASCII(SUBSTRING(@Text, 1, 1)) NOT BETWEEN 65 AND 90
      AND   ASCII(SUBSTRING(@Text, 1, 1)) NOT BETWEEN 97 AND 122
      THEN
         2
      ELSE
         1
   END AS a
,  1 AS b
,  SUBSTRING(@Text, 1, 0) AS Word
,  0 AS IsWord

UNION ALL

SELECT
   a AS a_in
,  b AS b_in
,  CASE
      WHEN
         ASCII(SUBSTRING(@Text, a, 1)) NOT BETWEEN 65 AND 90
      AND   ASCII(SUBSTRING(@Text, a, 1)) NOT BETWEEN 97 AND 122
      THEN
         a + 1
      WHEN
         ASCII(SUBSTRING(@Text, a + b, 1)) BETWEEN 65 AND 90
      OR ASCII(SUBSTRING(@Text, a + b, 1)) BETWEEN 97 AND 122
      THEN
         a
      ELSE
         a + 1 + b
   END AS a
,  CASE
      WHEN
         ASCII(SUBSTRING(@Text, a, 1)) NOT BETWEEN 65 AND 90
      AND   ASCII(SUBSTRING(@Text, a, 1)) NOT BETWEEN 97 AND 122
      THEN
         b
      WHEN
         ASCII(SUBSTRING(@Text, a + b, 1)) BETWEEN 65 AND 90
      OR ASCII(SUBSTRING(@Text, a + b, 1)) BETWEEN 97 AND 122
      THEN
         b + 1
      ELSE
         1
   END AS b
,  SUBSTRING(@Text, a, b) AS Word
,  CASE
      WHEN
         ASCII(SUBSTRING(@Text, a + b, 1)) BETWEEN 65 AND 90
      OR ASCII(SUBSTRING(@Text, a + b, 1)) BETWEEN 97 AND 122
      OR (
            ASCII(SUBSTRING(@Text, a, 1)) NOT BETWEEN 65 AND 90
         AND   ASCII(SUBSTRING(@Text, a, 1)) NOT BETWEEN 97 AND 122
         )
      THEN
         0
      ELSE
         1
   END AS IsWord
FROM
   Chars
WHERE
   a < LEN(@Text)

)

SELECT TOP 10
   Word
,  COUNT(*) AS NumberOfTimesUsed
FROM
   Chars
WHERE
   IsWord = 1
GROUP BY
   Word
ORDER BY
   2 DESC
OPTION
   (MAXRECURSION 0)
…and a screen shot of the ten most frequently used words. Apparently I like short words…
Top 10 Most Frequent Words

You can follow any responses to this entry through the RSS 2.0 You can leave a response, or trackback.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>