Monday, March 14, 2011

Unix Tail-like Functionality in PowerShell Revisited

My first attempt to replicate tail for PowerShell, which I wrote about in "Unix Tail-like Functionality in PowerShell", was horribly inefficient once you got past a couple dozen lines. This makes since given the method I was using -- a byte by byte reverse read of a text file, converting each byte at a time to ASCII. I knew the solution was to "wolf down" large byte chunks and process them as a whole. Using System.Text.ASCIIEncoding.GetString, I am doing that just after reading into memory multiple bytes using System.IO.FileStream.Read. With this change in methodology, I am getting to within 3% of the speed of tail in UNIX in my tests. The largest test I've performed was returning 1,000,000 lines from a 850MB log file. A Mac OS X 10.6.6 workstation performed the task in 16 seconds using tail and a Windows Server 2003 server returned in 17 seconds using this method. Good enough for me. Most of my needs are in the thousands of lines which I am able to return in hundreds of milliseconds which is perfect my monitoring scripts in Nagios. Compared to my previous attempt, this is a Lockheed SR-71 vs. a Wright Brothers Flyer. A small 5,000 tail using the old code took 5 1/2 minutes to return while this code returned the same lines in 200 milliseconds. Huge difference!

In the code sample below, I am using 10 kilobytes for that chunking. I found that number suited most of my needs. However, you can greatly increase that number for large number of lines to be returned (I used 4MB for my million line test). You can also do a little automatic tuning by altering the number of bytes using the number of lines you are seeking. One thing to be aware when passing files to this code, if you pass a file to System.IO.File/FileStream without a full path, it will not assume the file is located in the path of the executed script so Test-Path is not a valid test. Using System.IO.Directory.GetCurrentDirectory, you can find this by running the following in PowerShell:
[System.IO.Directory]::GetCurrentDirectory()
More than likely, it will point to the home directory of the profile the shell is executed under.

Also be aware that this tail-like function does not handle unicode log files. The method I am using to decode the bytes is ASCII dependent. I am not using System.Text.UnicodeEncoding yet in the code. Currently ASCII meets all my needs for reading log files but I am still interested in adding compatibility to this function. I am also assuming that all log files denote the end of a line using carriage return & line feed (CHR 13 + CHR 10) which is how majority of text files are written in Windows. UNIX & old style Macintosh text files will not work properly with this code. You will need to modify line 23 to change the delimiter for the split for those text file formats.

UPDATE: I have now finished an update that provides the "tail -f" functionality for continuously reading the updates to a text file. Read about it in my blog post, Replicating UNIX "tail -f" in PowerShell.

UPDATE: I have updated the code to handle unicode text files and non-Windows new lines. You can review the code here.
Function Read-EndOfFileByByteChunk($fileName,$totalNumberOfLines,$byteChunk) {
 if($totalNumberOfLines -lt 1) { $totalNumberOfLines = 1 }
 if($byteChunk -le 0) { $byteChunk = 10240 }
 $linesOfText = New-Object System.Collections.ArrayList
 if([System.IO.File]::Exists($fileName)) {
  $fileStream = New-Object System.IO.FileStream($fileName,[System.IO.FileMode]::Open,[System.IO.FileAccess]::Read,[System.IO.FileShare]::ReadWrite)
  $asciiEncoding = New-Object System.Text.ASCIIEncoding
  $fileSize = $fileStream.Length
  $byteOffset = $byteChunk
  [byte[]] $bytesRead = New-Object byte[] $byteChunk
  $totalBytesProcessed = 0
  $lastReadAttempt = $false
  do {
   if($byteOffset -ge $fileSize) {
    $byteChunk = $fileSize - $totalBytesProcessed
    [byte[]] $bytesRead = New-Object byte[] $byteChunk
    $byteOffset = $fileSize
    $lastReadAttempt = $true
   }
   $fileStream.Seek((-$byteOffset), [System.IO.SeekOrigin]::End) | Out-Null
   $fileStream.Read($bytesRead, 0, $byteChunk) | Out-Null
   $chunkOfText = New-Object System.Collections.ArrayList
   $chunkOfText.AddRange(([System.Text.RegularExpressions.Regex]::Split($asciiEncoding.GetString($bytesRead),"\r\n")))
   $firstLineLength = ($chunkOfText[0].Length)
   $byteOffset = ($byteOffset + $byteChunk) - ($firstLineLength)
   if($lastReadAttempt -eq $false -and $chunkOfText.count -lt $totalNumberOfLines) {
    $chunkOfText.RemoveAt(0)
   }
   $totalBytesProcessed += ($byteChunk - $firstLineLength)
   $linesOfText.InsertRange(0, $chunkOfText)
  } while($totalNumberOfLines -ge $linesOfText.count -and $lastReadAttempt -eq $false -and $totalBytesProcessed -lt $fileSize)
  $fileStream.Close()
  if($linesOfText.count -gt 1) {
   $linesOfText.RemoveAt($linesOfText.count-1)
  }
  $deltaLines = ($linesOfText.count - $totalNumberOfLines)
  if($deltaLines -gt 0) {
   $linesOfText.RemoveRange(0, $deltaLines)
  }
 } else {
  $linesOfText.Add("[ERROR] $fileName not found") | Out-Null
 }
 return $linesOfText
}
#--------------------------------------------------------------------------------------------------#
$fileName = "C:\Logs\really-huge.log" # Your really big log file
$numberOfLines = 100 # Number of lines from the end of the really big log file to return
$byteChunk = 10240 # Size of bytes read per seek during the search for lines to return
####################################################################################################
## This is a possible self-tuning method you can use but will blow up memory on an enormous 
## number of lines to return
## $byteChunk = $numberOfLines * 256 
####################################################################################################
$lastLines = @()

$lastLines = Read-EndOfFileByByteChunk $fileName $numberOfLines $byteChunk
foreach($lineOfText in $lastLines) {
 Write-Output $lineOfText
}

No comments:

Post a Comment