FFTW-3.3.2: changelog

FFTW 3.3.2

* Removed an archaic stack-alignment hack that was failing with
  gcc-4.7/i386.

* Added stack-alignment hack necessary for gcc on Windows/i386.  We
  will regret this in ten years (see previous change).

* Fix incompatibility with Intel icc which pretends to be gcc
  but does not support quad precision.

* make libfftw{threads,mpi} depend upon libfftw when using libtool;
  this is consistent with most other libraries and simplifies the life
  of various distributors of GNU/Linux.

FFTW 3.3.1

* Changes since 3.3.1-beta1:

  - Reduced planning time in estimate mode for sizes with large
    prime factors.

  - Added AVX autodetection under Visual Studio.  Thanks Carsten
    Steger for submitting the necessary code.

  - Modern Fortran interface now uses a separate fftw3l.f03 interface
    file for the long double interface, which is not supported by
    some Fortran compilers.  Provided new fftw3q.f03 interface file
    to access the quadruple-precision FFTW routines with recent
    versions of gcc/gfortran.

* Added support for the NEON extensions to the ARM ISA.  (Note to beta
  users: an ARM cycle counter is not yet implemented; please contact
  fftw@fftw.org if you know how to do it right.)

* MPI code now compiles even if mpicc is a C++ compiler; thanks to
  Kyle Spyksma for the bug report.

FFTW 3.3

* Changes since 3.3-beta1:

  - Compiling OpenMP support (--enable-openmp) now installs a
    fftw3_omp library, instead of fftw3_threads, so that OpenMP
    and POSIX threads (--enable-threads) libraries can be built
    and installed at the same time.

  - Various minor compilation fixes, corrections of manual typos, and
    improvements to the benchmark test program.

* Add support for the AVX extensions to x86 and x86-64.  The AVX code
  works with 16-byte alignment (as opposed to 32-byte alignment),
  so there is no ABI change compared to FFTW 3.2.2.

* Added Fortran 2003 interface, which should be usable on most modern
  Fortran compilers (e.g. gfortran) and provides type-checked access
  to the the C FFTW interface.  (The legacy Fortran-77 interface is
  still included also.)

* Added MPI distributed-memory transforms.  Compared to 3.3alpha,
  the major changes in the MPI transforms are:
    - Fixed some deadlock and crashing bugs.
    - Added Fortran 2003 interface.
    - Added new-array execute functions for MPI plans.
    - Eliminated use of large MPI tags, since Cray MPI requires tags < 2^24;
      thanks to Jonathan Bentz for the bug report.
    - Expanded documentation.
    - 'make check' now runs MPI tests
    - Some ABI changes - not binary-compatible with 3.3alpha MPI.

* Add support for quad-precision __float128 in gcc 4.6 or later (on x86.
  x86-64, and Itanium).  The new routines use the fftwq_ prefix.

* Removed support for MIPS paired-single instructions due to lack of
  available hardware for testing.  Users who want this functionality
  should continue using FFTW 3.2.x.  (Note that FFTW 3.3 still works
  on MIPS; this only concerns special instructions available on some
  MIPS chips.)

* Removed support for the Cell Broadband Engine.  Cell users should
  use FFTW 3.2.x.

* New convenience functions fftw_alloc_real and fftw_alloc_complex
  to use fftw_malloc for real and complex arrays without typecasts
  or sizeof.

* New convenience functions fftw_export_wisdom_to_filename and
  fftw_import_wisdom_from_filename that export/import wisdom
  to a file, which don't require you to open/close the file yourself.

* New function fftw_cost to return FFTW's internal cost metric for
  a given plan; thanks to Rhys Ulerich and Nathanael Schaeffer for the
  suggestion.

* The --enable-sse2 configure flag now works in both double and single
  precision (and is equivalent to --enable-sse in the latter case).

* Remove --enable-portable-binary flag: we new produce portable binaries
  by default.

* Remove the automatic detection of native architecture flag for gcc
  which was introduced in fftw-3.1, since new gcc supports -mtune=native.
  Remove the --with-gcc-arch flag; if you want to specify a particlar
  arch to configure, use ./configure CC="gcc -mtune=...".

* --with-our-malloc16 configure flag is now renamed --with-our-malloc.

* Fixed build problem failure when srand48 declaration is missing;
  thanks to Ralf Wildenhues for the bug report.

* Fixed bug in fftw_set_timelimit: ensure that a negative timelimit
  is equivalent to no timelimit in all cases.  Thanks to William Andrew
  Burnson for the bug report.

* Fixed stack-overflow problem on OpenBSD caused by using alloca with
  too large a buffer.

FFTW 3.2.2

* Improve performance of some copy operations of complex arrays on
  x86 machines.

* Add configure flag to disable alloca(), which is broken in mingw64.

* Planning in FFTW_ESTIMATE mode for r2r transforms became slower
  between fftw-3.1.3 and 3.2.  This regression has now been fixed.

FFTW 3.2.1

* Performance improvements for some multidimensional r2c/c2r transforms;
  thanks to Eugene Miloslavsky for his benchmark reports.

* Compile with icc on MacOS X, use better icc compiler flags.

* Compilation fixes for systems where snprintf is defined as a macro;
  thanks to Marcus Mae for the bug report.

* Fortran documentation now recommends not using dfftw_execute,
  because of reports of problems with various Fortran compilers;
  it is better to use dfftw_execute_dft etcetera.

* Some documentation clarifications, e.g. of fact that --enable-openmp
  and --enable-threads are mutually exclusive (thanks to Long To),
  and document slightly odd behavior of plan_guru_r2r in Fortran
  (thanks to Alexander Pozdneev).

* FAQ was accidentally omitted from 3.2 tarball.

* Remove some extraneous (harmless) files accidentally included in
  a subdirectory of the 3.2 tarball.

FFTW 3.2

* Worked around apparent glibc bug that leads to rare hangs when freeing
  semaphores.

* Fixed segfault due to unaligned access in certain obscure problems
  that use SSE and multiple threads.

* MPI transforms not included, as they are still in alpha; the alpha
  versions of the MPI transforms have been moved to FFTW 3.3alpha1.

FFTW 3.2alpha3

* Performance improvements for sizes with factors of 5 and 10.

* Documented FFTW_WISDOM_ONLY flag, at the suggestion of Mario
  Emmenlauer and Phil Dumont.

* Port Cell code to SDK2.1 (libspe2), as opposed to the old libspe1 code.

* Performance improvements in Cell code for N < 32k, thanks to Jan Wagner
  for the suggestions.

* Cycle counter for Sun x86_64 compiler, and compilation fix in cycle
  counter for AIX/xlc (thanks to Jeff Haferman for the bug report).

* Fixed incorrect type prefix in MPI code that prevented wisdom routines
  from working in single precision (thanks to Eric A. Borisch for the report).

* Added 'make check' for MPI code (which still fails in a couple corner
  cases, but should be much better than in alpha2).

* Many other small fixes.

FFTW 3.2alpha2

* Support for the Cell processor, donated by IBM Research; see README.Cell
  and the Cell section of the manual.

* New 64-bit API: for every "plan_guru" function there is a new "plan_guru64"
  function with the same semantics, but which takes fftw_iodim64 instead of
  fftw_iodim.  fftw_iodim64 is the same as fftw_iodim, except that it takes
  ptrdiff_t integer types as parameters, which is a 64-bit type on
  64-bit machines.  This is only useful for specifying very large transforms
  on 64-bit machines.  (Internally, FFTW uses ptrdiff_t everywhere
  regardless of what API you choose.)

* Experimental MPI support.  Complex one- and multi-dimensional FFTs,
  multi-dimensional r2r, multi-dimensional r2c/c2r transforms, and
  distributed transpose operations, with 1d block distributions.
  (This is an alpha preview: routines have not been exhaustively
  tested, documentation is incomplete, and some functionality is
  missing, e.g. Fortran support.)  See mpi/README and also the MPI
  section of the manual.

* Significantly faster r2c/c2r transforms, especially on machines with SIMD.

* Rewritten multi-threaded support for better performance by
  re-using a fixed pool of threads rather than continually
  respawning and joining (which nowadays is much slower).

* Support for MIPS paired-single SIMD instructions, donated by
  Codesourcery.

* FFTW_WISDOM_ONLY planner flag, to create plan only if wisdom is
  available and return NULL otherwise.

* Removed k7 support, which only worked in 32-bit mode and is
  becoming obsolete.  Use --enable-sse instead.

* Added --with-g77-wrappers configure option to force inclusion
  of g77 wrappers, in addition to whatever is needed for the
  detected Fortran compilers.  This is mainly intended for GNU/Linux
  distros switching to gfortran that wish to include both
  gfortran and g77 support in FFTW.

* In manual, renamed "guru execute" functions to "new-array execute"
  functions, to reduce confusion with the guru planner interface.
  (The programming interface is unchanged.)

* Add missing __declspec attribute to threads API functions when compiling
  for Windows; thanks to Robert O. Morris for the bug report.

* Fixed missing return value from dfftw_init_threads in Fortran;
  thanks to Markus Wetzstein for the bug report.

FFTW 3.1.1

* Performance improvements for Intel EMT64.

* Performance improvements for large-size transforms with SIMD.

* Cycle counter support for Intel icc and Visual C++ on x86-64.

* In fftw-wisdom tool, replaced obsolete --impatient with --measure.

* Fixed compilation failure with AIX/xlc; thanks to Joseph Thomas.

* Windows DLL support for Fortran API (added missing __declspec(dllexport)).

* SSE/SSE2 code works properly (i.e. disables itself) on older 386 and 486
  CPUs lacking a CPUID instruction; thanks to Eric Korpela.

FFTW 3.1

* Faster FFTW_ESTIMATE planner.

* New (faster) algorithm for REDFT00/RODFT00 (type-I DCT/DST) of odd size.

* "4-step" algorithm for faster FFTs of very large sizes (> 2^18).

* Faster in-place real-data DFTs (for R2HC and HC2R r2r formats).

* Faster in-place non-square transpositions (FFTW uses these internally
  for in-place FFTs, and you can also perform them explicitly using
  the guru interface).

* Faster prime-size DFTs: implemented Bluestein's algorithm, as well
  as a zero-padded Rader variant to limit recursive use of Rader's algorithm.

* SIMD support for split complex arrays.

* Much faster Altivec/VMX performance.

* New fftw_set_timelimit function to specify a (rough) upper bound to the
  planning time (does not affect ESTIMATE mode).

* Removed --enable-3dnow support; use --enable-k7 instead.

* FMA (fused multiply-add) version is now included in "standard" FFTW,
  and is enabled with --enable-fma (the default on PowerPC and Itanium).

* Automatic detection of native architecture flag for gcc.  New
  configure options: --enable-portable-binary and --with-gcc-arch=<arch>,
  for people distributing compiled binaries of FFTW (see manual).

* Automatic detection of Altivec under Linux with gcc 3.4 (so that
  same binary should work on both Altivec and non-Altivec PowerPCs).

* Compiler-specific tweaks/flags/workarounds for gcc 3.4, xlc, HP/UX,
  Solaris/Intel.

* Various documentation clarifications.

* 64-bit clean.  (Fixes a bug affecting the split guru planner on
  64-bit machines, reported by David Necas.)

* Fixed Debian bug #259612: inadvertent use of SSE instructions on
  non-SSE machines (causing a crash) for --enable-sse binaries.

* Fixed bug that caused HC2R transforms to destroy the input in
  certain cases, even if the user specified FFTW_PRESERVE_INPUT.

* Fixed bug where wisdom would be lost under rare circumstances,
  causing excessive planning time.

* FAQ notes bug in gcc-3.4.[1-3] that causes FFTW to crash with SSE/SSE2.

* Fixed accidentally exported symbol that prohibited simultaneous
  linking to double/single multithreaded FFTW (thanks to Alessio Massaro).

* Support Win32 threads under MinGW (thanks to Alessio Massaro).

* Fixed problem with building DLL under Cygwin; thanks to Stephane Fillod.

* Fix build failure if no Fortran compiler is found (thanks to Charles
  Radley for the bug report).

* Fixed compilation failure with icc 8.0 and SSE/SSE2.  Automatic
  detection of icc architecture flag (e.g. -xW).

* Fixed compilation with OpenMP on AIX (thanks to Greg Bauer).

* Fixed compilation failure on x86-64 with gcc (thanks to Orion Poplawski).

* Incorporated patch from FreeBSD ports (FreeBSD does not have memalign,
  but its malloc is 16-byte aligned).

* Cycle-counter compilation fixes for Itanium, Alpha, x86-64, Sparc,
  MacOS (thanks to Matt Boman, John Bowman, and James A. Treacy for
  reports/fixes).  Added x86-64 cycle counter for PGI compilers,
  courtesy Cristiano Calonaci.

* Fix compilation problem in test program due to C99 conflict.

* Portability fix for import_system_wisdom with djgpp (thanks to Juan
  Manuel Guerrero).

* Fixed compilation failure on MacOS 10.3 due to getopt conflict.

* Work around Visual C++ (version 6/7) bug in SSE compilation;
  thanks to Eddie Yee for his detailed report.

Changes from FFTW 3.1 beta 2:

* Several minor compilation fixes.

* Eliminate FFTW_TIMELIMIT flag and replace fftw_timelimit global with
  fftw_set_timelimit function.  Make wisdom work with time-limited plans.

Changes from FFTW 3.1 beta 1:

* Fixes for creating DLLs under Windows; thanks to John Pavel for his feedback.

* Fixed more 64-bit problems, thanks to John Pavel for the bug report.

* Further speed improvements for Altivec/VMX.

* Further speed improvements for non-square transpositions.

* Many minor tweaks.

FFTW 3.0.1

* Some speed improvements in SIMD code.

* --without-cycle-counter option is removed.  If no cycle counter is found,
  then the estimator is always used.  A --with-slow-timer option is provided
  to force the use of lower-resolution timers.

* Several fixes for compilation under Visual C++, with help from Stefane Ruel.

* Added x86 cycle counter for Visual C++, with help from Morten Nissov.

* Added S390 cycle counter, courtesy of James Treacy.

* Added missing static keyword that prevented simultaneous linkage
  of different-precision versions; thanks to Rasmus Larsen for the bug report.

* Corrected accidental omission of f77_wisdom.f file; thanks to Alan Watson.

* Support -xopenmp flag for SunOS; thanks to John Lou for the bug report.

* Compilation with HP/UX cc requires -Wp,-H128000 flag to increase
  preprocessor limits; thanks to Peter Vouras for the bug report.

* Removed non-portable use of 'tempfile' in fftw-wisdom-to-conf script;
  thanks to Nicolas Decoster for the patch.

* Added 'make smallcheck' target in tests/ directory, at the request of
  James Treacy.

FFTW 3.0

Major goals of this release:

* Speed: often 20% or more faster than FFTW 2.x, even without SIMD (see below).

* Complete rewrite, to make it easier to add new algorithms and transforms.

* New API, to support more general semantics.

Other enhancements:

* SIMD acceleration on supporting CPUs (SSE, SSE2, 3DNow!, and AltiVec).
(With special thanks to Franz Franchetti for many experimental prototypes
  and to Stefan Kral for the vectorizing generator from fftwgel.)

* True in-place 1d transforms of large sizes (as well as compressed
  twiddle tables for additional memory/cache savings).

* More arbitrary placement of real & imaginary data, e.g. including
  interleaved (as in FFTW 2.x) as well as separate real/imag arrays.

* Efficient prime-size transforms of real data.

* Multidimensional transforms can operate on a subset of a larger matrix,
  and/or transform selected dimensions of a multidimensional array.

* By popular demand, simultaneous linking to double precision (fftw),
  single precision (fftwf), and long-double precision (fftwl) versions
  of FFTW is now supported.

* Cycle counters (on all modern CPUs) are exploited to speed planning.

* Efficient transforms of real even/odd arrays, a.k.a. discrete
  cosine/sine transforms (types I-IV).  (Currently work via pre/post
  processing of real transforms, ala FFTPACK, so are not optimal.)

* DHTs (Discrete Hartley Transforms), again via post-processing
  of real transforms (and thus suboptimal, for now).

* Support for linking to just those parts of FFTW that you need,
  greatly reducing the size of statically linked programs when
  only a limited set of transform sizes/types are required.

* Canonical global wisdom file (/etc/fftw/wisdom) on Unix, along
  with a command-line tool (fftw-wisdom) to generate/update it.

* Fortran API can be used with both g77 and non-g77 compilers
  simultaneously.

* Multi-threaded version has optional OpenMP support.

* Authors' good looks have greatly improved with age.

Changes from 3.0beta3:

* Separate FMA distribution to better exploit fused multiply-add instructions
  on PowerPC (and possibly other) architectures.

* Performance improvements via some inlining tweaks.

* fftw_flops now returns double arguments, not int, to avoid overflows
  for large sizes.

* Workarounds for automake bugs.

Changes from 3.0beta2:

* The standard REDFT00/RODFT00 (DCT-I/DST-I) algorithm (used in
  FFTPACK, NR, etcetera) turns out to have poor numerical accuracy, so
  we replaced it with a slower routine that is more accurate.

* The guru planner and execute functions now have two variants, one that
  takes complex arguments and one that takes separate real/imag pointers.

* Execute and planner routines now automatically align the stack on x86,
  in case the calling program is misaligned.

* README file for test program.

* Fixed bugs in the combination of SIMD with multi-threaded transforms.

* Eliminated internal fftw_threads_init function, which some people were
  calling accidentally instead of the fftw_init_threads API function.

* Check for -openmp flag (Intel C compiler) when --enable-openmp is used.

* Support AMD x86-64 SIMD and cycle counter.

* Support SSE2 intrinsics in forthcoming gcc 3.3.

Changes from 3.0beta1:

* Faster in-place 1d transforms of non-power-of-two sizes.

* SIMD improvements for in-place, multi-dimensional, and/or non-FFTW_PATIENT
  transforms.

* Added support for hard-coded DCT/DST/DHT codelets of small sizes; the
  default distribution only includes hard-coded size-8 DCT-II/III, however.

* Many minor improvements to the manual.  Added section on using the
  codelet generator to customize and enhance FFTW.

* The default 'make check' should now only take a few minutes; for more
  strenuous tests (which may take a day or so), do 'cd tests; make bigcheck'.

* fftw_print_plan is split into fftw_fprint_plan and fftw_print_plan, where
  the latter uses stdout.

* Fixed ability to compile with a C++ compiler.

* Fixed support for C99 complex type under glibc.

* Fixed problems with alloca under MinGW, AIX.

* Workaround for gcc/SPARC bug.

* Fixed multi-threaded initialization failure on IRIX due to lack of
  user-accessible PTHREAD_SCOPE_SYSTEM there.