T-SQL: Get a List of Most Frequent Words in a String
October 21st, 2011 | Posted by in T-SQLI saw a thread on MSDN Forums recently asking for T-SQL to get a list of the top 10 most frequent words in a given string. While this could be done using cursors, I have a general dislike of them! It got me interested in whether or not this could be achieved using a common table expression…
So I had a think and put together some T-SQL, with some success (it didn’t have a syntax error or run forever). After a few changes I had broken the string up into a table of words, with a bit to indicate whether or not each record is a valid word. In this case, I just defined a word as a string of characters containing only a-z or A-Z, followed by any other character (or no character). This could easily be adjusted to include other characters though, for example ‘ to allow we’re, haven’t, etc.
Essentially the logic behind the CTE is…
- You have two integers a and b – one indicating the start of the current word (a), the other the possible length of the word (b).
- If the character at a isn’t an alphabetic character, then this isn’t the start of a word. So we need to move a to the next character.
- If the character at a + b (one character after the end of the possible current word) isn’t an alphabetic character then we’ve found a word. So we should mark the string starting at a of length b as a word. In this case we set a = a + b + 1, and reset b = 1. If a + b is an alphabetic character, then we haven’t reached the end of the word yet and we need to increment b by 1.
Below is the T-SQL I used, applied over this post.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | DECLARE @Text var char(max) = ''; -- remove space from var char -- A - Z = 65 = 90 -- a - z = 97 = 122 WITH Chars AS ( SELECT 1 AS a_in , 1 AS b_in , CASE WHEN ASCII(SUBSTRING(@Text, 1, 1)) NOT BETWEEN 65 AND 90 AND ASCII(SUBSTRING(@Text, 1, 1)) NOT BETWEEN 97 AND 122 THEN 2 ELSE 1 END AS a , 1 AS b , SUBSTRING(@Text, 1, 0) AS Word , 0 AS IsWord UNION ALL SELECT a AS a_in , b AS b_in , CASE WHEN ASCII(SUBSTRING(@Text, a, 1)) NOT BETWEEN 65 AND 90 AND ASCII(SUBSTRING(@Text, a, 1)) NOT BETWEEN 97 AND 122 THEN a + 1 WHEN ASCII(SUBSTRING(@Text, a + b, 1)) BETWEEN 65 AND 90 OR ASCII(SUBSTRING(@Text, a + b, 1)) BETWEEN 97 AND 122 THEN a ELSE a + 1 + b END AS a , CASE WHEN ASCII(SUBSTRING(@Text, a, 1)) NOT BETWEEN 65 AND 90 AND ASCII(SUBSTRING(@Text, a, 1)) NOT BETWEEN 97 AND 122 THEN b WHEN ASCII(SUBSTRING(@Text, a + b, 1)) BETWEEN 65 AND 90 OR ASCII(SUBSTRING(@Text, a + b, 1)) BETWEEN 97 AND 122 THEN b + 1 ELSE 1 END AS b , SUBSTRING(@Text, a, b) AS Word , CASE WHEN ASCII(SUBSTRING(@Text, a + b, 1)) BETWEEN 65 AND 90 OR ASCII(SUBSTRING(@Text, a + b, 1)) BETWEEN 97 AND 122 OR ( ASCII(SUBSTRING(@Text, a, 1)) NOT BETWEEN 65 AND 90 AND ASCII(SUBSTRING(@Text, a, 1)) NOT BETWEEN 97 AND 122 ) THEN 0 ELSE 1 END AS IsWord FROM Chars WHERE a < LEN(@Text) ) SELECT TOP 10 Word , COUNT(*) AS NumberOfTimesUsed FROM Chars WHERE IsWord = 1 GROUP BY Word ORDER BY 2 DESC OPTION (MAXRECURSION 0) |
You can follow any responses to this entry through the RSS 2.0 You can leave a response, or trackback.
