Follow by Email

Feb 10, 2010

Easy way of determining number of lines/records in a given large file using R

Dear Readers,

Today I would like to post the easy way of determining number of lines/records in any given large file using R.

Directly to point.

1) If data set is small let say less than 50MB or around in R one can read it with ease using:
length(readLines("xyzfile.csv"))

2) But if data set is too large say more than 1GB then reading through R throws the memory limit problem, since R takes all the records into memory and outputs the requested.

3) So, how to determine number of lines for large data set without getting into memory problems.

a) First for let's say of size about half GB or one million records/observations (assuming you are having 2GB RAM on your PC) the below code easily determine number of records with no memory related errors:

testcon <- br="" file="" open="r" xyzfile.csv=""> readsizeof <- 20000="" br=""> nooflines <- 0="" br=""> ( while((linesread <- length="" readlines="" readsizeof="" testcon=""> 0 )
nooflines <- br="" linesread="" nooflines=""> close(testcon)
nooflines

b) Next, even for size larger than half GB one can determine the number of records by bzipping the file and running the code as follows:
testcon <- br="" file="" open="r" xyzfile.csv.bz2=""> readsizeof <- 20000="" br=""> nooflines <- 0="" br=""> ( while((linesread <- length="" readlines="" readsizeof="" testcon=""> 0 )
nooflines <- br="" linesread="" nooflines=""> close(testcon)
nooflines

Second method has an advantage of disk space efficiency R from 2.10 version can
directly read zip files.

Thus, from next time wish readers will follow these easy method.

Have a nice programming with R. Author can be reached at mavuluri.pradeep@gmail.com.

2 comments:

Denis said...

... or on Unix
as.integer(system("wc -l xyzfile.csv | awk '{print $1}'",intern=TRUE))

Alex Zolot said...

On win+cygwin:
shell('c:/cygwin/bin/wc.exe -l filename', wait=T, intern=T)